88% found this document useful (8 votes)

1K views

Information Theory and Coding by Example

Information Theory and Coding

Uploaded by

sumanrbr

Available Formats

Download as PDF, TXT or read online on Scribd

88% found this document useful (8 votes)

1K views

Information Theory and Coding by Example

Information Theory and Coding

Uploaded by

sumanrbr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 528

I N F O R M AT I O N T H E O RY A N D C O D I N G B Y E X A M P L E

This fundamental monograph introduces both the probabilistic and the algebraic
aspects of information theory and coding. It has evolved from the authors years
of experience teaching at the undergraduate level, including several Cambridge
Mathematical Tripos courses. The book provides relevant background material, a
wide range of worked examples and clear solutions to problems from real exam
papers. It is a valuable teaching aid for undergraduate and graduate students, or for
researchers and engineers who want to grasp the basic principles.
Mark Kelbert is a Reader in Statistics in the Department of Mathematics at
Swansea University. For many years he has also been associated with the Moscow
Institute of Information Transmission Problems and the International Institute of
Earthquake Prediction Theory and Mathematical Geophysics (Moscow).
Yuri Suhov is a Professor of Applied Probability in the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge (Emeritus). He
is also affiliated to the University of Sao Paulo in Brazil and to the Moscow Institute
of Information Transmission Problems.

INFORMATION THEORY
A N D CODING BY EXAMPLE
M A R K K E L B E RT
Swansea University, and Universidade de Sao Paulo

Y U R I S U H OV
University of Cambridge, and Universidade de Sao Paulo

University Printing House, Cambridge CB2 8BS, United Kingdom

Published in the United States of America by Cambridge University Press, New York
Cambridge University Press is part of the University of Cambridge.
It furthers the Universitys mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9780521769358
c Cambridge University Press 2013

This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2013
Printed in the United Kingdom by CPI Group Ltd. Croydon cr0 4yy
A catalogue record for this publication is available from the British Library
ISBN 978-0-521-76935-8 Hardback
ISBN 978-0-521-13988-5 Paperback
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.

Contents

Preface
1

page vii

Essentials of Information Theory

1.1 Basic concepts. The Kraft inequality. Huffmans encoding
1.2 Entropy: an introduction
1.3 Shannons first coding theorem. The entropy rate of a
Markov source
1.4 Channels of information transmission. Decoding rules.
Shannons second coding theorem
1.5 Differential entropy and its properties
1.6 Additional problems for Chapter 1

1
1
18
41
59
86
95

Introduction to Coding Theory

2.1 Hamming spaces. Geometry of codes. Basic bounds on the
code size
2.2 A geometric proof of Shannons second coding theorem.
Advanced bounds on the code size
2.3 Linear codes: basic constructions
2.4 The Hamming, Golay and ReedMuller codes
2.5 Cyclic codes and polynomial algebra. Introduction to BCH
codes
2.6 Additional problems for Chapter 2

213
243

Further Topics from Coding Theory

3.1 A primer on finite fields
3.2 ReedSolomon codes. The BCH codes revisited
3.3 Cyclic codes revisited. Decoding the BHC codes
3.4 The MacWilliams identity and the linear programming bound
3.5 Asymptotically good codes
3.6 Additional problems for Chapter 3

269
269
291
300
313
328
340

144
144
162
184
199

Contents

Further Topics from Information Theory

4.1 Gaussian channels and beyond
4.2 The asymptotic equipartition property in the continuous time
setting
4.3 The NyquistShannon formula
4.4 Spatial point processes and network information theory
4.5 Selected examples and problems from cryptography
4.6 Additional problems for Chapter 4

366
366
397
409
436
453
480

Bibliography
Index

501
509

Preface

This book is partially based on the material covered in several Cambridge Mathematical Tripos courses: the third-year undergraduate courses Information Theory (which existed and evolved over the last four decades under slightly varied
titles) and Coding and Cryptography (a much younger and simplified course avoiding cumbersome technicalities), and a number of more advanced Part III courses
(Part III is a Cambridge equivalent to an MSc in Mathematics). The presentation
revolves, essentially, around the following core concepts: (a) the entropy of a probability distribution as a measure of uncertainty (and the entropy rate of a random
process as a measure of variability of its sample trajectories), and (b) coding as a
means to measure and use redundancy in information generated by the process.
Thus, the contents of this book includes a more or less standard package of
information-theoretical material which can be found nowadays in courses taught
across the world, mainly at Computer Science and Electrical Engineering Departments and sometimes at Probability and/or Statistics Departments. What makes this
book different is, first of all, a wide range of examples (a pattern that we followed
from the onset of the series of textbooks Probability and Statistics by Example
by the present authors, published by Cambridge University Press). Most of these
examples are of a particular level adopted in Cambridge Mathematical Tripos exams. Therefore, our readers can make their own judgement about what level they
have reached or want to reach.
The second difference between this book and the majority of other books
on information theory or coding theory is that it covers both possible directions: probabilistic and algebraic. Typically, these lines of inquiry are presented
in different monographs, textbooks and courses, often by people who work in
different departments. It helped that the present authors had a long-time association with the Institute for Information Transmission Problems, a section of the
Russian Academy of Sciences, Moscow, where the tradition of embracing a broad
spectrum of problems was strongly encouraged. It suffices to list, among others,
vii

viii

Preface

the names of Roland Dobrushin, Raphail Khasminsky, Mark Pinsker, Vladimir

Blinovsky, Vyacheslav Prelov, Boris Tsybakov, Kamil Zigangirov (probability and
statistics), Valentin Afanasiev, Leonid Bassalygo, Serguei Gelfand, Valery Goppa,
Inna Grushko, Grigorii Kabatyansky, Grigorii Margulis, Yuri Sagalovich, Alexei
Skorobogatov, Mikhail Tsfasman, Victor Zinovyev, Victor Zyablov (algebra, combinatorics, geometry, number theory), who worked or continue to work there (at
one time, all these were placed in a five-room floor of a converted building in the
centre of Moscow). Importantly, the Cambridge mathematical tradition of teaching
information-theoretical and coding-theoretical topics was developed along similar lines, initially by Peter Whittle (Probability and Optimisation) and later on by
Charles Goldie (Probability), Richard Pinch (Algebra and Geometry), Tom Korner
and Keith Carne (Analysis) and Tom Fisher (Number Theory).
We also would like to add that this book has been written by authors trained as
mathematicians (and who remain still mathematicians to their bones), who nevertheless have a strong background in applications, with all the frustration that comes
with such work: vagueness, imprecision, disputability (involving, inevitably, personal factors) and last but by no means least the costs of putting any mathematical idea however beautiful into practice. Still, they firmly believe that
mathematisation is the mainstream road to survival and perfection in the modern
competitive world, and therefore that Mathematics should be taken and studied
seriously (but perhaps not beyond reason).
Both aforementioned concepts (entropy and codes) forming the base of
the information-theoretical approach to random processes were introduced by
Shannon in the 1940s, in a rather accomplished form, in his publications [139],
[141]. Of course, entropy already existed in thermodynamics and was understood
pretty well by Boltzmann and Gibbs more than a century ago, and codes have
been in practical (and efficient) use for a very long time. But it was Shannon who
fully recognised the role of these concepts and put them into a modern mathematical framework, although, not having the training of a professional mathematician,
he did not always provide complete proofs of his constructions. [Maybe he did
not bother.] In relevant sections we comment on some rather bizarre moments in
the development of Shannons relations with the mathematical community. Fortunately, it seems that this did not bother him much. [Unlike Boltzmann, who was
particularly sensitive to outside comments and took them perhaps too close to his
heart.] Shannon definitely understood the full value of his discoveries; in our view
it puts him on equal footing with such towering figures in mathematics as Wiener
and von Neumann.
It is fair to say that Shannons name still dominates both the probabilistic and the
algebraic direction in contemporary information and coding theory. This is quite
extraordinary, given that we are talking of the contribution made by a person who

Preface

was active in this area more than 40 years ago. [Although on several advanced
topics Shannon, probably, could have thought, re-phrasing Einsteins words: Since
mathematicians have invaded the theory of communication, I do not understand it
myself anymore.]
During the years that passed after Shannons inceptions and inventions, mathematics changed drastically, and so did electrical engineering, let alone computer
science. Who could have foreseen such a development back in the 1940s and 1950s,
as the great rivalry between Shannons information-theoretical and Wieners cybernetical approaches was emerging? In fact, the latter promised huge (even fantastic)
benefits for the whole of humanity while the former only asserted that a modest goal of correcting transmission errors could be achieved within certain limits.
Wieners book [171] captivated the minds of 1950s and 1960s thinkers in practically all domains of intellectual activity. In particular, cybernetics became a serious
political issue in the Soviet Union and its satellite countries: first it was declared
a bourgeois anti-scientific theory, then it was over-enthusiastically embraced. [A
quotation from a 1953 critical review of cybernetics in a leading Soviet ideology
journal Problems of Philosophy reads: Imperialists are unable to resolve the controversies destroying the capitalist society. They cant prevent the imminent economical crisis. And so they try to find a solution not only in the frenzied arms race
but also in ideological warfare. In their profound despair they resort to the help of
pseudo-sciences that give them some glimmer of hope to prolong their survival.
The 1954 edition of the Soviet Concise Dictionary of Philosophy printed in hundreds of thousands of copies defined cybernetics as a reactionary pseudo-science
which appeared in the USA after World War II and later spread across other capitalist countries: a kind of modern mechanicism. However, under pressure from
top Soviet physicists who gained authority after successes of the Soviet nuclear
programme, the same journal, Problems of Philosophy, had to print in 1955 an article proclaiming positive views on cybernetics. The authors of this article included
Alexei Lyapunov and Sergei Sobolev, prominent Soviet mathematicians.]
Curiously, as was discovered in a recent biography on Wiener [35], there exist
secret [US] government documents that show how the FBI and the CIA pursued
Wiener at the height of the Cold War to thwart his social activism and the growing
influence of cybernetics at home and abroad. Interesting comparisons can be found
in [65].
However, history went its own way. As Freeman Dyson put it in his review [41]
of [35]: [Shannons theory] was mathematically elegant, clear, and easy to apply
to practical problems of communication. It was far more user-friendly than cybernetics. It became the basis of a new discipline called information theory . . . [In
modern times] electronic engineers learned information theory, the gospel according to Shannon, as part of their basic training, and cybernetics was forgotten.

Preface

Not quite forgotten, however: in the former Soviet Union there still exist at
least seven functioning institutes or departments named after cybernetics: two in
Moscow and two in Minsk, and one in each of Tallinn, Tbilisi, Tashkent and Kiev
(the latter being a renowned centre of computer science in the whole of the former USSR). In the UK there are at least four departments, at the Universities of
Bolton, Bradford, Hull and Reading, not counting various associations and societies. Across the world, cybernetics-related societies seem to flourish, displaying
an assortment of names, from concise ones such as the Institute of the Method
(Switzerland) or the Cybernetics Academy (Italy) to the Argentinian Association of the General Theory of Systems and Cybernetics, Buenos Aires. And we
were delighted to discover the existence of the Cambridge Cybernetics Society
(Belmont, CA, USA). By contrast, information theory figures only in a handful of
institutions names. Apparently, the old Shannon vs. Wiener dispute may not be
over yet.
In any case, Wieners personal reputation in mathematics remains rock solid:
it suffices to name a few gems such as the PaleyWiener theorem (created on
Wieners numerous visits to Cambridge), the WienerHopf method and, of course,
the Wiener process, particularly close to our hearts, to understand his true role in
scientific research and applications. However, existing recollections of this giant of
science depict an image of a complex and often troubled personality. (The title of
the biography [35] is quite revealing but such views are disputed, e.g., in the review
[107]. In this book we attempt to adopt a more tempered tone from the chapter on
Wiener in [75], pp. 386391.) On the other hand, available accounts of Shannons
life (as well as other fathers of information and coding theory, notably, Richard
Hamming) give a consistent picture of a quiet, intelligent and humorous person.
It is our hope that this fact will not present a hindrance for writing Shannons
biographies and that in future we will see as many books on Shannon as we see on
Wiener.
As was said before, the purpose of this book is twofold: to provide a synthetic
introduction both to probabilistic and algebraic aspects of the theory supported by
a significant number of problems and examples, and to discuss a number of topics
rarely presented in most mainstream books. Chapters 13 give an introduction into
the basics of information theory and coding with some discussion spilling over to
more modern topics. We concentrate on typical problems and examples [many of
them originated in Cambridge courses] more than on providing a detailed presentation of the theory behind them. Chapter 4 gives a brief introduction into a variety
of topics from information theory. Here the presentation is more concise and some
important results are given without proofs.
Because the large part of the text stemmed from lecture notes and various solutions to class and exam problems, there are inevitable repetitions, multitudes of

Preface

notation and examples of pigeon English. We left many of them deliberately,

feeling that they convey a live atmosphere during the teaching and examination
process.
Two excellent books [52] and [36] had a particularly strong impact on our presentation. We feel that our long-term friendship with Charles Goldie played a role
here, as well as YSs amicable acquaintance with Tom Cover. We also benefited
from reading (and borrowing from) the books [18], [110], [130] and [98]. The
warm hospitality at a number of programmes at the Isaac Newton Institute, University of Cambridge, in 20022010 should be acknowledged, particularly Stochastic Processes in Communication Sciences (JanuaryJuly 2010). Various parts of
the material have been discussed with colleagues in various institutions, first and
foremost, the Institute for Information Transmission Problems and the Institute of
Mathematical Geophysics and Earthquake Predictions, Moscow (where the authors
have been loyal staff members for a long time). We would like to thank James
Lawrence, from Statslab, University of Cambridge, for his kind help with figures.
References to PSE I and PSE II mean the books by the present authors Probability and Statistics by Example, Cambridge University Press, Volumes I and II.
We adopted the style used in PSE II, presenting a large portion of the material
through Worked Examples. Most of these Worked Examples are stated as problems (and many of them originated from Cambridge Tripos Exam papers and keep
their specific style and spirit).

1
Essentials of Information Theory

Throughout the book, the symbol P denotes various probability distributions. In

particular, in Chapter 1, P refers to the probabilities for sequences of random
variables characterising sources of information. As a rule, these are sequences of
independent and identically distributed random variables or discrete-time Markov
chains; namely, P(U1 = u1 , . . . ,Un = un ) is the joint probability that random
variables U1 , . . . ,Un take values u1 , . . . , un , and P(V = v |U = u,W = w) is the
conditional probability that a random variable V takes value v, given that random variables U and W take values u and w, respectively. Likewise, E denotes the
expectation with respect to P.
The symbols p and P are used to denote various probabilities (and probabilityrelated objects) loosely. The symbol A denotes the cardinality of a finite set A.
The symbol 1 stands for an indicator function. We adopt the following notation and
formal rules for logarithms: ln = loge , log = log2 , and for all b > 1: 0 logb 0 = 0
logb = 0. Next, given x > 0, x and x denote the maximal integer that is no
larger than x and the minimal integer that is no less than x, respectively. Thus,
x x x; equalities hold here when x is a positive integer (x is called the
integer part of x.)
The abbreviations LHS and RHS stand, respectively, for the left-hand side and
the right-hand side of an equation.

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

A typical scheme used in information transmission is as follows:
A message source an encoder a channel
a decoder a destination

Essentials of Information Theory

Example 1.1.1 (a) A message source: a Cambridge college choir.

(b) An encoder: a BBC recording unit. It translates the sound to a binary array and
writes it to a CD track. The CD is then produced and put on the market.
(c) A channel: a customer buying a CD in England and mailing it to Australia. The
channel is subject to noise: possible damage (mechanical, electrical, chemical,
etc.) incurred during transmission (transportation).
(d) A decoder: a CD player in Australia.
(e) A destination: an audience in Australia.
(f) The goal: to ensure a high-quality sound despite damage.
In fact, a CD can sustain damage done by a needle while making a neat hole in
it, or by a tiny drop of acid (you are not encouraged to make such an experiment!).
In technical terms, typical goals of information transmission are:
(i) fast encoding of information,
(ii) easy transmission of encoded messages,
(iii) effective use of the channel available (i.e. maximum transfer of information
per unit time),
(iv) fast decoding,
(v) correcting errors (as many as possible) introduced by noise in the channel.
As usual, these goals contradict each other, and one has to find an optimal solution. This is what the chapter is about. However, do not expect perfect solutions:
the theory that follows aims mainly at providing knowledge of the basic principles.
A final decision is always up to the individual (or group) responsible.
A large part of this section (and the whole of Chapter 1) will deal with encoding
problems. The aims of encoding are:
(1) compressing data to reduce redundant information contained in a message,
(2) protecting the text from unauthorised users,
(3) enabling errors to be corrected.
We start by studying sources and encoders. A source emits a sequence of letters
(or symbols),
(1.1.1)
u1 u2 . . . un . . . ,
where u j I, and I(= Im ) is an m-element set often identified as {1, . . . , m}
(a source alphabet). In the case of literary English, m = 26 + 7, 26 letters plus
7 punctuation symbols: . , : ; ( ). (Sometimes one adds ? ! and ). Telegraph
English corresponds to m = 27.
A common approach is to consider (1.1.1) as a sample from a random source,
i.e. a sequence of random variables
U1 ,U2 , . . . ,Un , . . .
and try to develop a theory for a reasonable class of such sequences.

(1.1.2)

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

Example 1.1.2 (a) The simplest example of a random source is a sequence of

independent and identically distributed random variables (IID random variables):
k

P(U1 = u1 , U2 = u2 , . . . ,Uk = uk ) = p(u j ),

(1.1.3a)

j=1

where p(u) = P(U j = u), u I, is the marginal distribution of a single variable. A

random source with IID symbols is often called a Bernoulli source.
A particular case where p(u) does not depend on u U (and hence equals 1/m)
corresponds to the equiprobable Bernoulli source.
(b) A more general example is a Markov source where the symbols form a discretetime Markov chain (DTMC):
k1

P(U1 = u1 , U2 = u2 , . . . , Uk = uk ) = (u1 ) P(u j , u j+1 ),

(1.1.3b)

j=1

where (u) = P(U1 = u), u I, are the initial probabilities and P(u, u ) = P(U j+1 =
u |U j = u), u, u I, are transition probabilities. A Markov source is called stationary if P(U j = u) = (u), j 1, i.e. = { (u), u = 1, . . . , m} is an invariant
row-vector for matrix P = {P(u, v)}: (u)P(u, v) = (v), v I, or, shortly,
uI

P = .
(c) A degenerated example of a Markov source is where a source emits repeated
symbols. Here,
P(U1 = U2 = = Uk = u) = p(u), u I,
P(Uk = Uk ) = 0, 1 k < k ,

(1.1.3c)

where 0 p(u) 1 and p(u) = 1.

An initial piece of sequence (1.1.1)

u(n) = (u1 , u2 , . . . , un ) or, more briefly, u(n) = u1 u2 . . . un
is called a (source) sample n-string, or n-word (in short, a string or a word), with
digits from I, and is treated as a message. Correspondingly, one considers a random n-string (a random message)
U(n) = (U1 ,U2 , . . . ,Un ) or, briefly, U(n) = U1U2 . . .Un .
An encoder (or coder) uses an alphabet J(= Jq ) which we typically write as
{0, 1, . . . , q 1}; usually the number of encoding symbols q < m (or even q m);
in many cases q = 2 with J = {0, 1} (a binary coder). A code (also coding, or

Essentials of Information Theory

encoding) is a map, f , that takes a symbol u I into a finite string, f (u) = x1 . . . xs ,

with digits from J. In other words, f maps I into the set J of all possible strings:

f : I J =
J (s times) J .
s1

Strings f (u) that are images, under f , of symbols u I are called codewords
(in code f ). A code has (constant) length N if the value s (the length of a codeword) equals N for all codewords. A message u(n) = u1 u2 . . . un is represented as a
concatenation of codewords
f (u(n) ) = f (u1 ) f (u2 ) . . . f (un );
it is again a string from J .
Definition 1.1.3 We say that a code is lossless if u = u implies that f (u) = f (u ).
(That is, the map f : I J is one-to-one.) A code is called decipherable if any
string from J is the image of at most one message. A string x is a prefix in another
string y if y = xz, i.e. y may be represented as a result of a concatenation of x and z.
A code is prefix-free if no codeword is a prefix in any other codeword (e.g. a code
of constant length is prefix-free).
A prefix-free code is decipherable, but not vice versa:
Example 1.1.4 A code with three source letters 1, 2, 3 and the binary encoder
alphabet J = {0, 1} given by
f (1) = 0, f (2) = 01, f (3) = 011
is decipherable, but not prefix-free.
Theorem 1.1.5 (The Kraft inequality) Given positive integers s1 , . . . , sm , there
exists a decipherable code f : I J , with codewords of lengths s1 , . . . , sm , iff
m

(1.1.4)

i=1

Furthermore, under condition (1.1.4) there exists a prefix-free code with codewords
of lengths s1 , . . . , sm .
Proof (I) Sufficiency. Let (1.1.4) hold. Our goal is to construct a prefix-free code
with codewords of lengths s1 , . . . , sm . Rewrite (1.1.4) as
s

nl ql 1,

l=1

(1.1.5)

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

or
s1

ns qs 1 nl ql ,
l=1

where nl is the number of codewords of length l and s = max si . Equivalently,

ns qs n1 qs1 ns1 q.

(1.1.6a)

Since ns 0, deduce that

ns1 q qs n1 qs1 ns2 q2 ,
or
ns1 qs1 n1 qs2 ns2 q.

(1.1.6b)

Repeating this argument yields subsequently

ns2 qs2 n1 qs3 . . . ns3 q
..
..
..
.
.
.
q2 n1 q
n2
n1 q.

(1.1.6.s1)

(1.1.6.s)

Observe that actually either ni+1 = 0 or ni is less than the RHS of the inequality,
for all i = 1, . . . , s 1 (by definition, ns 1 so that for i = s 1 the second possibility occurs). We can perform the following construction. First choose n1 words
of length 1, using distinct symbols from J: this is possible in view of (1.1.6.s).
It leaves (q n1 ) symbols unused; we can form (q n1 )q words of length 2 by
appending a symbol to each. Choose n2 codewords from these: we can do so in
view of (1.1.6.s1). We still have q2 n1 q n2 words unused: form n3 codewords,
etc. In the course of the construction, no new word contains a previous codeword
as a prefix. Hence, the code constructed is prefix-free.
(II) Necessity. Suppose there exists a decipherable code in J with codeword
lengths s1 , . . . , sm . Set s = max si and observe that for any positive integer r
rs
r
s
q 1 + + qsm = bl ql
l=1

where bl is the number of ways r codewords can be put together to form a string of
length l.

Essentials of Information Theory

Because of decipherability, these strings must be distinct. Hence, we must have

bl ql , as ql is the total number of l-strings. Then
r
s
q 1 + + qsm rs,
and
s1

++q

1
= exp (log r + log s) .
r

1/r 1/r

This is true for any r, so take r . The RHS goes to 1.

Remark 1.1.6

A given code obeying (1.1.4) is not necessarily decipherable.

Leon G. Kraft introduced inequality (1.1.4) in his MIT PhD thesis in 1949.
One of the principal aims of the theory is to find the best (that is, the shortest)
decipherable (or prefix-free) code. We now adopt a probabilistic point of view and
assume that symbol u I is emitted by a source with probability p(u):
P(Uk = u) = p(u).
[At this point, there is no need to specify a joint probability of more than one
subsequently emitted symbol.]
Recall, given a code f : I J , we encode a letter i I by a prescribed codeword f (i) = x1 . . . xs(i) of length s(i). For a random symbol, the generated codeword
becomes a random string from J . When f is lossless, the probability of generating
a given string as a codeword for a symbol is precisely p(i) if the string coincides
with f (i) and 0 if there is no letter i I with this property. If f is not one-to-one,
the probability of a string equals the sum of terms p(i) for which the codeword f (i)
equals this string. Then the length of a codeword becomes a random variable, S,
with the probability distribution

P(S = s) =

1(s(i) = s)p(i).

(1.1.7)

1im

We are looking for a decipherable code that minimises the expected word-length:
ES =

sP(S = s) = s(i)p(i).

i=1

The following problem therefore arises:

minimise g(s(1), . . . , s(m)) = E S
subject to qs(i) 1 (Kraft)
i

with s(i) positive integers.

(1.1.8)

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

Theorem 1.1.7
lows:

The optimal value for problem (1.1.8) is lower-bounded as folmin ES hq (p(1), . . . , p(m)),

(1.1.9)

hq (p(1), . . . , p(m)) = p(i) logq p(i).

(1.1.10)

where
i

Proof The algorithm (1.1.8) is an integer-valued optimisation problem. If we drop

the condition that s(1), . . . , s(m) {1, 2, . . .}, replacing it with a relaxed constraint s(i) > 0, 1 i m, the Lagrange sufficiency theorem could be used. The
Lagrangian reads
L (s(1), . . . , s(m), z; ) = s(i)p(i) + (1 qs(i) z)
i

(here, z 0 is a slack variable). Minimising L in s1 , . . . , sm and z yields

< 0,

z = 0, and

L
= p(i) + qs(i) ln q = 0,
s(i)

whence

p(i)
= qs(i) ,
ln q

i.e. s(i) = logq p(i) + logq ( ln q), 1 i m.

Adjusting the constraint qs(i) = 1 (the slack variable z = 0) gives

p(i)/( ln q) = 1,

i.e. ln q = 1.

Hence,
s(i) = logq p(i),

1 i m,

is the (unique) optimiser for the relaxed problem, giving the value hq from (1.1.10).
The relaxed problem is solved on a larger set of variables s(i); hence, its minimal
value does not exceed that in the original one.
Remark 1.1.8 The quantity hq defined in (1.1.10) plays a central role in the
whole of information theory. It is called the q-ary entropy of the probability distribution (p(x), x I) and will emerge in a great number of situations. Here we note
that the dependence on q is captured in the formula
hq (p(1), . . . , p(m)) =

1
h2 (p(1), . . . , p(m))
log q

where h2 stands for the binary entropy:

h2 (p(1), . . . , p(m)) = p(i) log p(i).
i

(1.1.11)

Essentials of Information Theory

Worked Example 1.1.9 (a) Give an example of a lossless code with alphabet
Jq which does not satisfy the Kraft inequality. Give an example of a lossless code
with the expected code-length strictly less than hq (X).
(b) Show that the Kraft sum qs(i) associated with a lossless code may be
i

arbitrarily large (for sufficiently large source alphabet).

Solution (a) Consider the alphabet I = {0, 1, 2} and a lossless code f with f (0) =
0, f (1) = 1, f (2) = 00 and codeword-lengths s(0) = s(1) = 1, s(2) = 2. Obviously,
2s(x) = 5/4, violating the Kraft inequality. For a random variable X with p(0) =
xI

p(1) = p(2) = 1/3 the expected codeword-length Es(X) = 4/3 < h(X) = log 3 =
1.585.

(b) Assume that the alphabet size m = I = 2(2L 1) for some positive
integer L. Consider the lossless code assigning to the letters x I the codewords
0, 1, 00, 01, 10, 11, 000, . . ., with the maximum codeword-length L. The Kraft sum is

2s(x) =

lL x:s(x)=l

2l 2l = L,

which can be made arbitrarily large.

The assertion of Theorem 1.1.7 is further elaborated in
Theorem 1.1.10
(Shannons noiseless coding theorem (NLCT)) For a random source emitting symbols with probabilities p(i) > 0, the minimal expected
codeword-length for a decipherable encoding in alphabet Jq obeys
hq min ES < hq + 1,

(1.1.12)

where hq = p(i) logq p(i) is the q-ary entropy of the source; see (1.1.10).
i

Proof The LHS inequality is established in (1.1.9). For the RHS inequality, let
s(i) be a positive integer such that
qs(i) p(i) < qs(i)+1 .
The non-strict bound here implies qs(i) p(i) = 1, i.e. the Kraft inequality.
i

Hence, there exists a decipherable code with codeword-lengths s(1), . . . , s(m). The
strict bound implies
s(i) <

log p(i)
+ 1,
log q

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

and thus
ES <

p(i) log p(i)

log q

+ p(i) =
i

h
+ 1.
log q

Example 1.1.11 An instructive application of Shannons NLCT is as follows. Let

the size m of the source alphabet equal 2k and assume that the letters i = 1, . . . , m are
emitted equiprobably: p(i) = 2k . Suppose we use the code alphabet J2 = {0, 1}
(binary encoding). With the binary entropy h2 = log 2k 2k = k, we need,
1i2k

on average, at least k binary digits for decipherable encoding. Using a term bit for
a unit of entropy, we say that on average the encoding requires at least k bits.
Moreover, the NLCT leads to a ShannonFano encoding procedure: we fix positive integer codeword-lengths s(1), . . . , s(m) such that qs(i) p(i) < qs(i)+1 , or,
equivalently,

(1.1.13)
logq p(i) s(i) < logq p(i) + 1; that is, s(i) = logq p(i) .
Then construct a prefix-free code, from the shortest s(i) upwards, ensuring that
the previous codewords are not prefixes. The Kraft inequality guarantees enough
room. The obtained code may not be optimal but has the mean codeword-length
satisfying the same inequalities (1.1.13) as an optimal code.
Optimality is achieved by Huffmans encoding fmH : Im Jq . We first discuss
it for binary encodings, when q = 2 (i.e. J = {0, 1}). The algorithm constructs a
binary tree, as follows.
(i) First, order the letters i I so that p(1) p(2) p(m).
(ii) Assign symbol 0 to letter m 1 and 1 to letter m.
(iii) Construct a reduced alphabet Im1 = {1, . . . , m 2, (m 1, m)}, with probabilities
p(1), . . . , p(m 2), p(m 1) + p(m).
Repeat steps (i) and (ii) with the reduced alphabet, etc. We obtain a binary tree. For
an example of Huffmans encoding for m = 7 see Figure 1.1.
The number of branches we must pass through in order to reach a root i of the
tree equals s(i). The tree structure, together with the identification of the roots
as source letters, guarantees that encoding is prefix-free. The optimality of binary
Huffman encoding follows from the following two simple lemmas.

Essentials of Information Theory

m= 7
i

f(i)
0
100
101
110
1110
11110

1
3
3
3
4
5

7 .025 11111

1 .5
2 .15
3 .15
4 .1
5 .05
6 .025

1.0

.15

.15 .1 .05 .025

.025

Figure 1.1

Lemma 1.1.12 Any optimal prefix-free binary code has the codeword-lengths
reverse-ordered versus probabilities:
p(i) p(i ) implies

s(i) s(i ).

(1.1.14)

Proof If not, we can form a new code, by swapping the codewords for i and i .
This shortens the expected codeword-length and preserves the prefix-free property.

Lemma 1.1.13 In any optimal prefix-free binary code there exist, among the
codewords of maximum length, precisely two agreeing in all but the last digit.
Proof If not, then either (i) there exists a single codeword of maximum length,
or (ii) there exist two or more codewords of maximum length, and they all differ
before the last digit. In both cases we can drop the last digit from some word of
maximum length, without affecting the prefix-free property.
Theorem 1.1.14
codes.

Huffmans encoding is optimal among the prefix-free binary

Proof The proof proceeds with induction in m. For m = 2, the Huffman code f2H
has f2H (1) = 0, f2H (2) = 1, or vice versa, and is optimal. Assume the Huffman code
H is optimal for I
fm1
m1 , whatever the probability distribution. Suppose further that

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

the Huffman code fmH is not optimal for Im for some probability distribution. That
is, there is another prefix-free code, fm , for Im with a shorter expected word-length:

H
< ESm
.
ESm

(1.1.15)

The probability distribution under consideration may be assumed to obey

p(1) p(m).
By Lemmas 1.1.12 and 1.1.13, in both codes we can shuffle codewords so that
the words corresponding to m 1 and m have maximum length and differ only in
the last digit. This allows us to reduce both codes to Im1 . Namely, in the Huffman
code fmH we remove the final digit from fmH (m) and fmH (m 1), glueing these
H . In f we do the same, and obtain
codewords. This leads to Huffman encoding fm1
m

a new prefix-free code fm1 .

H from f H (m 1)
Observe that in Huffman code fmH the contribution to ESm
m
and fmH (m) is sH (m)(p(m 1) + p(m)); after reduction it becomes (sH (m) 1)
(p(m 1) + p(m)). That is, ES is reduced by p(m 1) + p(m). In code fm the similar contribution is reduced from s (m)(p(m 1) + p(m)) to (s (m) 1)(p(m 1)
H
+ p(m)); the difference is again p(m 1) + p(m). All other contributions to ESm1

H and ES ,
and ESm1
are the same as the corresponding contributions to ESm
m

H
respectively. Therefore, fm1 is better than fm1 : ESm1 < ESm1 , which contradicts the assumption.
In view of Theorem 1.1.14, we obtain
Corollary 1.1.15
codes.

Huffmans encoding is optimal among the decipherable binary

The generalisation of the Huffman procedure to q-ary codes (with the code
alphabet Jq = {0, 1, . . . , q 1}) is straightforward: instead of merging two symbols, m 1, m Im , having lowest probabilities, you merge q of them (again with
the smallest probabilities), repeating the above argument. In fact, Huffmans original 1952 paper was written for a general encoding alphabet. There are numerous
modifications of the Huffman code covering unequal coding costs (where some of
the encoding digits j Jq are more expensive than others) and other factors; we
will not discuss them in this book.

A drawback of Huffman encoding is that the

Worked Example 1.1.16
codeword-lengths are complicated functions of the symbol probabilities p(1), . . .,
p(m). However, some bounds are available. Suppose that p(1) p(2)
p(m). Prove that in any binary Huffman encoding:
(a) if p(1) < 1/3 then letter 1 must be encoded by a codeword of length 2;
(b) if p(1) > 2/5 then letter 1 must be encoded by a codeword of length 1.

Essentials of Information Theory

p (1) + p (b) + p (b) < 1

1
p (1) < 1/
3

1
p (b)

_ p (1) < 2/

3 p (1) + p (b)

p (b)

0 < p (b) < p (1)

2/ < p (1)
5
p (3)

(a)

p (b)
p (4)

(b)

Figure 1.2

Solution (a) Two cases are possible: the letter 1 either was, or was not merged with
other letters before two last steps in constructing a Huffman code. In the first case,
s(1) 2. Otherwise we have symbols 1, b and b , with
p(1) < 1/3,

p(1) + p(b) + p(b ) = 1 and hence max[p(b), p(b )] > 1/3.

Then letter 1 is to be merged, at the last but one step, with one of b, b , and hence
s(1) 2. Indeed, suppose that at least one codeword has length 1, and this codeword is assigned to letter 1 with p(1) < 1/3. Hence, the top of the Huffman tree is
as in Figure 1.2(a) with 0 p(b), p(b ) 1 p(1) and p(b) + p(b ) = 1 p(1).

) > 1/3, and hence p(1) should be merged with

But
then
max
p(b),
p(b

min p(b), p(b ) . Hence, Figure 1.2(a) is impossible, and letter 1 has codewordlength 2.
The bound is sharp as both codes
{0, 01, 110, 111} and {00, 01, 10, 11}
are binary Huffman codes, e.g. for a probability distribution 1/3, 1/3, 1/4, 1/12.
(b) Now let p(1) > 2/5 and assume that letter 1 has a codeword-length s(1) 2 in
a Huffman code. Thus, letter 1 was merged with other letters before the last step.
That is, at a certain stage, we had symbols 1, b and b say, with
(A)
(B)
(C)
(D)

p(b ) p(1) > 2/5,

p(b ) p(b),
p(1) + p(b) + p(b ) 1
p(1), p(b) 1/2 p(b ).

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

Indeed, if, say, p(b) < 1/2p(b ) then b should be selected instead of p(3) or p(4)
on the previous step when p(b ) was formed. By virtue of (D), p(b) 1/5 which
makes (A)+(C) impossible.
A piece of the Huffman tree over p(1) is then as in Figure 1.2(b), with p(3) +
p(4) = p(b ) and p(1) + p(b ) + p(b) 1. Write
p(1) = 2/5 + , p(b ) = 2/5 + + , p(b) = 2/5 + + ,
with > 0, , 0. Then
p(1) + p(b ) + p(b) = 6/5 + 3 + 2 1, and 1/5 + 3 + 2 .
This yields
p(b) 1/5 2 < 1/5.
However, since

max p(3), p(4) p(b )/2 p(1)/2 > 1/5,

probability p(b) should be merged with min p(3), p(4) , i.e. diagram (b) is
impossible. Hence, the letter 1 has codeword-length s(1) = 1.

Worked Example 1.1.17 Suppose that letters i1 , . . . , i5 are emitted with probabilities 0.45, 0.25, 0.2, 0.05, 0.05. Compute the expected word-length for Shannon
Fano and Huffman coding. Illustrate both methods by finding decipherable binary
codings in each case.
Solution In this case q = 2, and

ShannonFano:

p(i) log2 p(i) codeword

.45
2
00
.25
2
01
.2
3
100
.05
5
11100
.05
5
11111

with E codeword-length) = .9 + .5 + .6 + .25 + .25 = 2.5, and

Huffman:

pi codeword
.45
1
.25
01
.2
000
.05
0010
.05
0011

with E codeword-length) = 0.45 + 0.5 + 0.6 + 0.2 + 0.2 = 1.95.

Essentials of Information Theory

Worked Example 1.1.18 A ShannonFano code is in general not optimal. However, it is not much longer than Huffmans. Prove that, if SSF is the Shannon
Fano codeword-length, then for any r = 1, 2, . . . and any decipherable code f with
codeword-length S ,

P S SSF r q1r .
Solution Write

P(S SSF r) =

p(i).

iI : s (i)sSF (i)r

Note that sSF (i) < logq p(i) + 1, hence

p(i)

iI : s (i)sSF (i)r

p(i)

iI : s (i) logq p(i)+1r

iI : s (i)1+r logq p(i)

p(i)
(i)+1r

iI : p(i)qs

(i)+1r

qs
iI
1r

qs (i)
iI

the last inequality is due to Kraft.

A common modern practice is not to encode each letter u I separately, but
to divide a source message into segments or blocks, of a fixed length n, and
encode these as letters. It obviously increases the nominal number of letters in
the alphabet: the blocks are from the Cartesian product I n = I (n times) I.
But what matters is the entropy
(n)

hq =

P(U1 = i1 , . . . ,Un = in ) logq P(U1 = i1 , . . . ,Un = in )

(1.1.16)

i1 ,...,in

of the probability distribution for the blocks in a typical message. [Obviously,

we need to know the joint distribution of the subsequently emitted source letters.] Denote by S(n) the random codeword-length in a decipherable segmented
code. The minimal expected codeword-length per source letter is defined by
1
en := min ES(n) ; by Shannons NLCT, it obeys
n
(n)

(n)

hq
1
hq
en
+ .
n
n
n
(n)
We see that, for large n, en hq n.

(1.1.17)

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

Example 1.1.19 For a Bernoulli source emitting letter i with probability p(i) (cf.
Example 1.1.2), equation (1.1.16) yields

(n)
hq = p(i1 ) p(in ) logq p(i1 ) p(in )
i1 ,...,in
n

p(i1 ) p(in ) logq p(i j ) = nhq ,

(1.1.18)

j=1 i1 ,...,in

where hq = p(i) logq p(i). Here, en hq . Thus, for n large, the minimal
expected codeword-length per source letter, in a segmented code, eventually attains the lower bound in (1.1.13), and hence does not exceed min ES, the minimal
expected codeword-length for letter-by-letter encodings. This phenomenon is much
more striking in the situation where the subsequent source letters are dependent. In
(n)
many cases hq n hq , i.e. en hq . This is the gist of data compression.
Therefore, statistics of long strings becomes an important property of a source.
Nominally, the strings u(n) = u1 . . . un of length n fill the Cartesian power I n ; the
total number of such strings is mn , and to encode them all we need mn = 2n log m
distinct codewords. If the codewords have a fixed length (which guarantees the
prefix-free property), this length is between n log m, and n log m, and the rate
of encoding, for large n, is log m bits/source letter. But if some strings are rare,
we can disregard them, reducing the number of codewords used. This leads to the
following definitions.
Definition 1.1.20 A source is said to be (reliably) encodable at rate R > 0 if, for
any n, we can find a set An I n such that
An 2nR

and

lim P(U(n) An ) = 1.

(1.1.19)

In other words, we can encode messages at rate R with a negligible error for long
source strings.
Definition 1.1.21 The information rate H of a given source is the infimum of the
reliable encoding rates:
H = inf[R : R is reliable].
Theorem 1.1.22

(1.1.20)

For a source with alphabet Im ,

0 H log m,

both bounds being attainable.

(1.1.21)

Essentials of Information Theory

Proof The LHS inequality is trivial. It is attained for a degenerate source

(cf. Example 1.1.2c); here An contains m constant strings, which is eventually
beaten by 2nR for any R > 0. On the other hand, I n = mn = 2n log m , hence the
RHS inequality. It is attained for a source with IID letters and p(u) = 1/m: in this
case P(An ) = (1/mn ) An , which goes to zero when An 2nR and R < log m.
Example 1.1.23 (a) For telegraph English, m = 27 24.76 , i.e. H 4.76. Fortunately, H 4.76, and this makes possible: (i) data compression, (ii) errorcorrecting, (iii) code-breaking, (iv) crosswords. The precise value of H for
telegraph English (not to mention literary English) is not known: it is a challenging
task to assess it accurately. Nevertheless, modern theoretical tools and computing facilities make it possible to assess the information rate of a given (long) text,
assuming that it comes from a source that operates by allowing a fair amount of
randomness and homogeneity (see Section 6.3 of [36].)
Some results of numerical analysis can be found in [136] analysing three texts:
(a) the collected works of Shakespeare; (b) a mixed text from various newspapers;
and (c) the King James Bible. The texts were stripped of punctuation and the spaces
between words were removed. Texts (a) and (b) give values 1.7 and 1.25 respectively (which is rather flattering to modern journalism). In case (c) the results were
inconclusive; apparently the above assumptions are not appropriate in this case.
(For example, the genealogical enumerations of Genesis are hard to compare with
the philosophical discussions of Pauls letters, so the homogeneity of the source is
obviously not maintained.)
Even more challenging is to compare different languages: which one is more
appropriate for intercommunication? Also, it would be interesting to repeat the
above experiment with the collected works of Tolstoy or Dostoyevsky.
For illustration, we give below the original table by Samuel Morse (17911872),
creator of the Morse code, providing the frequency count of different letters in
telegraph English which is dominated by a relatively small number of common
words.

E
T
A
I
N
O
S
H
R
12000 9000 8000 8000 8000 8000 8000 6400 6200

D
L
U
C
M
F
W
Y
G

4400 4000 3400 3000 3000 2500 2000 2000 1700

B
V
K
Q
J
X
Z
1700 1600 1200 800 500 400 400 200
(b) A similar idea was applied to the decimal and binary decomposition of a
given number. For example, take number . If the information rate for its binary

1.1 Basic concepts. The Kraft inequality. Huffmans encoding

ln x

1
x

Figure 1.3

decomposition approaches value 1 (which is the information rate of a randomly

chosen sequence), we may think that behaves like a completely random number; otherwise we could imagine that was a Specially Chosen One. The
same question
may be asked
about e, 2 or the EulerMascheroni constant

1
ln N . (An open part of one of Hilberts problems is to prove
= lim

N 1nN n
or disprove that is a transcendental number, and transcendental numbers form
a set of probability one under the Bernoulli source of subsequent digits.) As the
results of numerical experiments show, for the number of digits N 500, 000 all
the above-mentioned numbers display the same pattern of behaviour as a completely random number; see [26]. In Section 1.3 we will calculate the information
rates of Bernoulli and Markov sources.
We conclude this section with the following simple but fundamental fact.
Theorem 1.1.24 (The Gibbs inequality: cf. PSE II, p. 421) Let {p(i)} and {p (i)}
be two probability distributions (on a finite or countable set I ). Then, for any b > 1,

p(i) logb
i

p (i)
0, i.e. p(i) logb p(i) p(i) logb p (i),
p(i)
i
i

and equality is attained iff p(i) = p (i), 1 I .

Proof

The bound
logb x

x1
ln b

(1.1.22)

Essentials of Information Theory

holds for each x > 0, with equality iff x = 1. Setting I = {i : p(i) > 0}, we have

1
p (i)
p (i)
p (i)
= p(i) logb

1
p(i)
p(i) logb
p(i)
p(i)
ln biI
i
iI

p(i)
1
1

=
p (i) p(i) =
p (i) 1 0.
ln b iI
ln b iI
iI
For equality we need: (a) p (i) = 1, i.e. p (i) = 0 when p(i) = 0; and (b)
p (i)/p(i) = 1 for i I .

1.2 Entropy: an introduction

Only entropy comes easy.
Anton Chekhov (18601904), Russian writer and playwright
This section is entirely devoted to properties of entropy. For simplicity, we work
with the binary entropy, where the logarithms are taken at base 2. Consequently,
subscript 2 in the notation h2 is omitted. We begin with a formal repetition of the
basic definition, putting a slightly different emphasis.
Definition 1.2.1 Given an event A with probability p(A), the information gained
from the fact that A has occurred is defined as
i(A) = log p(A).
Further, let X be a random variable taking a finite number of distinct values
{x1 , . . . , xm }, with probabilities pi = pX (xi ) = P(X = xi ). The binary entropy h(X)
is defined as the expected amount of information gained from observing X:

h(X) = pX (xi ) log pX (xi ) = pi log pi = E log pX (X) .
(1.2.1)
xi

[In view of the adopted equality 0 log 0 = 0, the sum may be reduced to those xi
for which pX (xi ) > 0.]
Sometimes an alternative view is useful: i(A) represents the amount of information needed to specify event A and h(X) gives the expected amount of information
required to specify a random variable X.
Clearly, the entropy h(X) depends on the probability distribution, but not on
the values x1 , . . . , xm : h(X) = h(p1 , . . . , pm ). For m = 2 (a two-point probability
distribution), it is convenient to consider the function (p)(= 2 (p)) of a single
variable p [0, 1]:

(p) = p log p (1 p) log(1 p).

(1.2.2a)

1.2 Entropy: an introduction

0.0

0.2

0.4

h(p,1p)

0.6

0.8

1.0

Entropy for p in [0,1]

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1.4
Entropy for p, q, p+q in [0,1]

1.5

h( p,q,1pq)

1.0

0.5

0.8
0.6
q

0.4
0.2
0.0
0.0
0.0

0.2

0.4

0.6

0.8

Figure 1.5

The graph of (p) is plotted in Figure 1.4. Observe that the graph is concave as

d2
(p) = log e [p(1 p)] < 0. See Figure 1.4.
2
dp
The graph of the entropy of a three-point distribution

3 (p, q) = p log p q log q (1 p q) log(1 p q)

Essentials of Information Theory

is plotted in Figure 1.5 as a function of variables p, q [0, 1] with p + q 1: it also

shows the concavity property.
Definition 1.2.1 implies that for independent events, A1 and A2 ,
i(A1 A2 ) = i(A1 ) + i(A2 ),

(1.2.2b)

and i(A) = 1 for event A with p(A) = 1/2.

A justification of Definition 1.2.1 comes from the fact that any function i (A)
which (i) depends on probability p(A) (i.e. obeys i (A) = i (A ) if p(A) = p(A )), (ii)
is continuous in p(A), and (iii) satisfies (1.2.2b), coincides with i(A) (for axiomatic
definitions of entropy, cf. Worked Example 1.2.24 below).
Definition 1.2.2 Given a pair of random variables, X, Y , with values xi and y j ,
the joint entropy h(X,Y ) is defined by

(1.2.3)
h(X,Y ) = pX,Y (xi , y j ) log pX,Y (xi , y j ) = E log pX,Y (X,Y ) ,
xi ,y j

where pX,Y (xi , y j ) = P(X = xi ,Y = y j ) is the joint probability distribution. In other

words, h(X,Y ) is the entropy of the random vector (X,Y ) with values (xi , y j ).
The conditional entropy, h(X|Y ), of X given Y is defined as the expected amount
of information gained from observing X given that a value of Y is known:

(1.2.4)
h(X|Y ) = pX,Y (xi , y j ) log pX|Y (xi |y j ) = E log pX|Y (X|Y ) .
xi ,y j

Here, pX,Y (i, j) is the joint probability P(X = xi ,Y = y j ) and pX|Y (xi |y j ) the conditional probability P(X = xi |Y = y j ). Clearly, (1.2.3) and (1.2.4) imply
h(X|Y ) = h(X,Y ) h(Y ).

(1.2.5)

Note that in general h(X|Y ) = h(Y |X).

For random variables X and Y taking values in the same set I, and such that
pY (x) > 0 for all x I, the relative entropy h(X||Y ) (also known as the entropy of
X relative to Y or KullbackLeibler distance D(pX ||pY )) is defined by

pX (x)
pY (X)
h(X||Y ) = pX (x) log
= EX log
,
(1.2.6)
pY (x)
pX (X)
x
with pX (x) = P(X = x) and pY (x) = P(Y = x), x I.
Straightforward properties of entropy are given below.
Theorem 1.2.3
(a) If a random variable X takes at most m values, then
0 h(X) log m;

(1.2.7)

1.2 Entropy: an introduction

the LHS equality occurring iff X takes a single value, and the RHS equality
occurring iff X takes m values with equal probabilities.
(b) The joint entropy obeys
h(X,Y ) h(X) + h(Y ),

(1.2.8)

with equality iff X and Y are independent, i.e. P(X = x,Y = y) = P(X = x)
P(Y = y) for all x, y I .
(c) The relative entropy is always non-negative:
h(X||Y ) 0,

(1.2.9)

with equality iff X and Y are identically distributed: pX (x) pY (x), x I .

Proof Assertion (c) is equivalent to Gibbs inequality from Theorem 1.1.24. Next,
(a) follows from (c), with {p(i)} being the distribution of X and p (i) 1/m,
1 i m. Similarly, (b) follows from (c), with i being a pair (i1 , i2 ) of values
of X and Y , p(i) = pX,Y (i1 , i2 ) being the joint distribution of X and Y and p (i) =
pX (i1 )pY (i2 ) representing the product of their marginal distributions. Formally:
(a) h(X) = p(i) log p(i) p(i) log m = log m,
i

(b) h(X,Y ) = pX,Y (i1 , i2 ) log pX,Y (i1 , i2 )

(i1 ,i2 )

pX,Y (i1 , i2 ) log pX (i1 )pY (i2 )
(i1 ,i2 )

= pX (i1 ) log pX (i1 ) pY (i2 ) log pY (i2 )

= h(X) + h(Y ).
We used here the identities pX,Y (i1 , i2 ) = pX (i1 ), pX,Y (i1 , i2 ) = pY (i2 ).
i2

Worked Example 1.2.4

(a) Show that the geometric random variable Y with p j = P(Y = j) = (1 p)p j ,
j = 0, 1, 2, . . ., yields maximum entropy amongst all distributions on Z+ =
{0, 1, 2, . . .} with the same mean.
(b) Let Z be a random variable with values
from a finite

set K and f be a given real

function f : K R,with f = min f (k) : k K and f = max f (k) : k K .

Set E( f ) = f (k) ( K) and consider the problem of maximising the entropy
kK

h(Z) of the random variable Z subject to a constraint

E f (Z) .

(1.2.10)

Essentials of Information Theory

Show that:
(bi) when f E( f ) then the maximising probability distribution is uniform on K , with P(Z = k) = 1/( K), k K ;
(bii) when f < E( f ) and f is not constant then the maximising probability distribution has

P(Z = k) = pk = e f (k)

e f (i) ,

k K,

(1.2.11)

where = ( ) < 0 is chosen so as to satisfy

pk f (k) = .

(1.2.12)

Moreover, suppose that Z takes countably many values, but f 0 and for a
given there exists a < 0 such that e f (i) < and pk f (k) = where pk
i

has form (1.2.11). Then:

(biii) the probability distribution in (1.2.11) still maximises h(Z) under
(1.2.10).

Deduce assertion (a) from (biii).

(c) Prove that hY (X) 0, with equality iff P(X = x) = P(Y = x) for all x. By
considering Y , a geometric random variable on Z+ with parameter chosen
appropriately, show that if the mean EX = < , then
h(X) ( + 1) log( + 1) log ,

(1.2.13)

with equality iff X is geometric.

Solution (a) By the Gibbs inequality, for all probability distribution (q0 , q1 , . . .)
with mean iqi ,
i0

h(q) = qi log qi qi log pi = qi (log(1 p) + i log p)

log(1 p) log p = h(Y )

as = p/(1 p), and equality holds iff q is geometric with mean .

(b) First, observe that the uniform distribution, with pk = 1 ( K), which renders the global maximum of h(Z) is obtained for = 0 in (1.2.11). In part (bi),
this distribution satisfies (1.2.10)
and hence maximises h(Z) under this constraint.
Passing to (bii), let pk = e f (k)

e f (i) , k K, where is chosen to satisfy

E f (Z) = pk f (k) = . Let q = {qk } be any probability distribution satisfying

1.2 Entropy: an introduction

Eq f = qk f (k) . Next, observe that the mean value (1.2.12) calculated for the
k

probability distribution from (1.2.11) is a non-decreasing function of . In fact, the

derivative
2

2 f (k)

f
(k)
[ f (k)] e
f (k)e
d
k
2
2
=
k
2 = E[ f (Z)] [E f (Z)]

f
(i)
d
e
e f (i)
i
i

is positive (it yields the variance of the random variable f (Z)); for a non-constant
f the RHS is actually non-negative. Therefore, for non-constant f (i.e. with
f < E( f ) < f ), for all from the interval [ f , f ] there exists exactly one probability distribution of form (1.2.11) satisfying (1.2.12), and for f < E( f ) the
corresponding ( ) is < 0.
Next, we use the fact that the KullbackLeibler distance D(q||p ) (cf. (1.2.6))
satisfies D(q||p ) = qk log (qk /pk ) 0 (Gibbs inequality) and that qk f (k)
k

and < 0 to obtain that

h(q) = qk log qk = D(q||p ) qk log pk
k

qk log pk

= qk log e f (i) + f (k)

qk log e f (i)
k

=
k

log e f (i) + f (k)

= pk log pk = h(p ).
k

For part (biii): the above argument still works for an infinite countable set K
provided that the value ( ) determined from (1.2.12) is < 0.
(c) By the Gibbs inequality hY (X) 0. Next, we use part (b) by taking f (k) = k,
= and = ln q. The maximum-entropy distribution can be written as pj =
(1 p)p j , j = 0, 1, 2, . . ., with kpk = , or = p/(1 p). The entropy of this
k

distribution equals

h(p ) = (1 p)p j log (1 p)p j
j

p
log p log(1 p) = ( + 1) log( + 1) log ,
1 p

where = p/(1 p).

Essentials of Information Theory

Alternatively:
0 hY (X) = p(i) log
i

p(i)
(1 p)pi

= h(X) log(1 p) p(i) (log p)

ip(i)
i

= h(X) log(1 p) log p.

The optimal choice of p is p = /( + 1). Then
h(X) log

log
= ( + 1) log( + 1) log .
+1
+1

The RHS is the entropy h(Y ) of the geometric random variable Y . Equality holds
iff X Y , i.e. X is geometric.
A simple but instructive corollary of the Gibbs inequality is
Lemma 1.2.5

(The pooling inequalities) For any q1 , q2 0, with q1 + q2 > 0,

(q1 + q2 ) log(q1 + q2 ) q1 log q1 q2 log q2

q1 + q2
;
(1.2.14)
(q1 + q2 ) log
2
the first equality occurs iff q1 q2 = 0 (i.e. either q1 or q2 vanishes), and the second
equality iff q1 = q2 .
Proof Indeed, (1.2.14) is equivalent to

q1
q2
,
0h
log 2 (= 1).
q1 + q2 q1 + q2
By Lemma 1.2.5, glueing together values of a random variable could diminish the corresponding contribution to the entropy. On the other hand, the redistribution of probabilities making them equal increases the contribution. An
immediate corollary of Lemma 1.2.5 is the following.
Theorem 1.2.6 Suppose that a discrete random variable X is a function of discrete random variable Y : X = (Y ). Then
h(X) h(Y ),

(1.2.15)

with equality iff is invertible.

Proof Indeed, if is invertible then the probability distributions of X and Y differ
only in the order of probabilities, which does not change the entropy. If glues
some values y j then we can repeatedly use the LHS pooling inequality.

1.2 Entropy: an introduction

log x

1/
2

Figure 1.6

Worked Example 1.2.7 Let p1 , . . . , pn be a probability distribution, with p =

max[pi ]. Prove the following lower bounds for the entropy h = pi log pi :
i

p log p (1 p ) log(1 p )

(i) h
(ii) h log p ;
(iii) h 2(1 p ).

(p );

Solution Part (i) follows from the pooling inequality, and (ii) holds as
h pi log p = log p .
i

To check (iii), assume first that p 1/2.

Since

the function p (p), 0 p 1,

is concave (see (1.2.3)), its graph on 1/2, 1 lies above the line x 2(1 p).
Then, by (i),
h (p ) 2 (1 p ) .

(1.2.16)

On the other hand, if p 1/2, we use (ii):

h log p ,
and apply the inequality log p 2(1 p) for 0 p 1/2.
Theorem 1.2.8 (The Fano inequality) Suppose a random variable X takes m > 1
values, and one of them has probability (1 ). Then
h(X) ( ) + log(m 1)

where is the function from (1.2.2a).

(1.2.17)

Essentials of Information Theory

Proof Suppose that p1 = p(x1 ) = 1 . Then

h(X) = h(p1 , . . . , pm ) = pi log pi

i=1

= p1 log p1 (1 p1 ) log(1 p1 ) + (1 p1 ) log(1 p1 )

pi log pi

2im

p2
pm
= h(p1 , 1 p1 ) + (1 p1 )h
,...,
1 p1
1 p1

;

in the RHS the first term is ( ) and the second one does not exceed log(m 1).
Definition 1.2.9 Given random variables X, Y , Z, we say that X and Y are conditionally independent given Z if, for all x and y and for all z with P(Z = z) > 0,
P(X = x,Y = y|Z = z) = P(X = x|Z = z)P(Y = y|Z = z).

(1.2.18)

For the conditional entropy we immediately obtain

Theorem 1.2.10

(a) For all random variables X , Y ,

0 h(X|Y ) h(X),

(1.2.19)

the first equality occurring iff X is a function of Y and the second equality holding
iff X and Y are independent.
(b) For all random variables X , Y , Z ,
h(X|Y, Z) h(X|Y ) h(X| (Y )),

(1.2.20)

the first equality occurring iff X and Z are conditionally independent given Y and
the second equality holding iff X and Z are conditionally independent given (Y ).
Proof (a) The LHS bound in (1.2.19) follows from definition (1.2.4) (since
h(X|Y ) is a sum of non-negative terms). The RHS bound follows from representation (1.2.5) and bound (1.2.8). The LHS quality in (1.2.19) is equivalent
to the equation h(X,Y ) = h(Y ) or h(X,Y ) = h( (X,Y )) with (X,Y ) = Y . In
view of Theorem 1.2.6, this occurs iff, with probability 1, the map (X,Y ) Y
is invertible, i.e. X is a function of Y . The RHS equality in (1.2.19) occurs iff
h(X,Y ) = h(X) + h(Y ), i.e. X and Y are independent.
(b) For the lower bound, use a formula analogous to (1.2.5):
h(X|Y, Z) = h(X, Z|Y ) h(Z|Y )

(1.2.21)

and an inequality analogous to (1.2.10):

h(X, Z|Y ) h(X|Y ) + h(Z|Y ),

(1.2.22)

1.2 Entropy: an introduction

with equality iff X and Z are conditionally independent given Y . For the RHS
bound, use:
(i) a formula that is a particular case of (1.2.21): h(X|Y, (Y )) = h(X,Y | (Y ))
h(Y | (Y )), together with the remark that h(X|Y, (Y )) = h(X|Y );
(ii) an inequality which is a particular case of (1.2.22): h(X,Y | (Y ))
h(X| (Y )) + h(Y | (Y )), with equality iff X and Y are conditionally independent
given (Y ).
Theorems 1.2.8 above and 1.2.11 below show how the entropy h(X) and conditional entropy h(X|Y ) are controlled when X is nearly a constant (respectively,
nearly a function of Y ).
Theorem 1.2.11 (The generalised Fano inequality) For a pair of random variables, X and Y taking values x1 , . . . , xm and y1 , . . . , ym , if
m

P(X = x j ,Y = y j ) = 1 ,

(1.2.23)

h(X|Y ) ( ) + log(m 1),

(1.2.24)

j=1

then
where ( ) is defined in (1.2.3).
Proof

Denoting j = P(X = x j |Y = y j ), we write

pY (y j ) j = P(X = x j ,Y = y j ) = .
j

(1.2.25)

By definition of the conditional entropy, the Fano inequality and concavity of

the function ( ),

h(X|Y ) pY (y j ) ( j ) + j log(m 1)
j

pY (y j ) ( j ) + log(m 1) ( ) + log(m 1).

If the random variable X takes countably many values {x1 , x2 , . . .}, the above
definitions may be repeated, as well as most of the statements; notable exceptions
are the RHS bound in (1.2.7) and inequalities (1.2.17) and (1.2.24).
Many properties of entropy listed so far are extended to the case of random
strings.
Theorem 1.2.12
(Y1 , . . . ,Yn ),

For a pair of random strings, X(n) = (X1 , . . . , Xn ) and Y(n) =

Essentials of Information Theory

(a) the joint entropy, given by

h(X(n) ) = P(X(n) = x(n) ) log P(X(n) = x(n) ),
x(n)

obeys
n

i=1

h(X(n) ) = h(Xi |X(i1) ) h(Xi ),

(1.2.26)

with equality iff components X1 , . . . , Xn are independent;

(b) the conditional entropy, given by
h(X(n) |Y(n) )
=

P(X(n) = x(n) , Y(n) = y(n) ) log P(X(n) = x(n) |Y(n) = y(n) ),

x(n) ,y(n)

satisfies
n

i=1

h(X(n) |Y(n) ) h(Xi |Y(n) ) h(Xi |Yi ),

(1.2.27)

with the LHS equality holding iff X1 , . . . , Xn are conditionally independent, given
Y(n) , and the RHS equality holding iff, for each i = 1, . . . , n, Xi and {Yr : 1 r
n, r = i} are conditionally independent, given Yi .
Proof The proof repeats the arguments used previously in the scalar case.
Definition 1.2.13 The mutual information or mutual entropy, I(X : Y ), between
X and Y is defined as
I(X : Y ) := pX,Y (x, y) log
x,y

pX,Y (X,Y )
pX,Y (x, y)
= E log
pX (x)pY (y)
pX (X)pY (Y )

= h(X) + h(Y ) h(X,Y ) = h(X) h(X|Y )

= h(Y ) h(Y |X).

(1.2.28)

As can be seen from this definition, I(X : Y ) = I(Y : X).

Intuitively, I(X : Y ) measures the amount of information about X conveyed by Y
(and vice versa). Theorem 1.2.10(b) implies
Theorem 1.2.14

If a random variable (Y ) is a function of Y then

0 I(X : (Y )) I(X : Y ),

(1.2.29)

the first equality occurring iff X and (Y ) are independent, and the second iff X
and Y are conditionally independent, given (Y ).

1.2 Entropy: an introduction

Worked Example 1.2.15

Suppose that two non-negative random variables X
and Y are related by Y = X + N , where N is a geometric random variable taking
values in Z+ and is independent of X . Determine the distribution of Y which maximises the mutual entropy between X and Y under the constraint that the mean
EX K and show that this distribution can be realised by assigning to X the value
zero with a certain probability and letting it follow a geometrical distribution with
a complementary probability.
Solution Because Y = X + N where X and N are independent, we have
I(X : Y ) = h(Y ) h(Y |X) = h(Y ) h(N).
Also E(Y ) = E(X) + E(N) K + E(N). Therefore, if we can guarantee that Y may
be taken geometrically distributed with mean K + E(N) then it gives the maximal
value of I(X : Y ). To this end, write an equation for probability-generating functions:
E(zY ) = E(zX )E(zN ), z > 0,
with E(zN ) = (1 p)/(1 zp), 0 < z < 1/p, and
E(zY ) =

1 p
1
, 0<z< ,
1 zp
p

where p is to be found from an equation

Y =

K(1 p) + p
p
p
=
.
=K+
1 p
1 p
1 p

This yields
p =

K(1 p) + p
1 p
, E(zY ) =
,
1 + K(1 p)
1 + K(1 p) z(p + K(1 p))

and
E(zX ) =

1 zp
.
1 + K(1 p) z(p + K(1 p))

(1.2.30)

The form of the distribution of X suggested in the example leads to

E(zX ) = 0 + (1 0 )

1 pX
,
1 zpX

where 0 + (1 0 )(1 pX ) = P(X = 0). Selecting

pX =

p + K(1 p)
p
, 0 =
,
1 + K(1 p)
p + K(1 p)

we see that (1.2.30) and (1.2.31) coincide.

(1.2.31)

Essentials of Information Theory

I only ask for information. . .

Charles Dickens (18121870), English writer,
from David Copperfield
In Definition 1.2.13 and Theorem 1.2.14, random variables X and Y may be
replaced by random strings. In addition, by repeating the above arguments for
strings X(n) and Y(n) , we obtain
Theorem 1.2.16

(a) The mutual entropy between random strings obeys

i=1

I(X(n) : Y(n) ) h(X(n) ) h(Xi |Y(n) ) h(X(n) ) h(Xi |Yi ).

(1.2.32)

(b) If X1 , . . . , Xn are independent then

I(X(n) : Y(n) ) I(Xi : Y(n) ).

(1.2.33)

i=1

Observe that
n

i=1

I(Xi : Y(n) ) I(Xi : Yi ).

Worked Example 1.2.17

be a random string.

(1.2.34)

Let X , Z be random variables and Y(n) = (Y1 , . . . ,Yn )

(a) Prove the inequality

0 I(X : Z) min{h(X), h(Z)}.
(b) Prove or disprove by producing a counter-example the inequality
I(X : Y(n) )

I(X : Y j ),

(1.2.35)

j=1

first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
(c) Prove or disprove by producing a counter-example the inequality
I(X : Y(n) )

I(X : Y j ),

(1.2.36)

j=1

first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .

1.2 Entropy: an introduction

Solution (a) By the Gibbs inequality, I(X : Z) 0, and

I(X : Z) := P(X = x, Z = z) log
x,z

P(X = x, Z = z)
P(X = x)P(Z = z)

= h(X) h(X|Z) = h(Z) h(Z|X).

Here h(X|Z) 0 and h(Z|X)

0. Hence I(X : Z) h(X) and I(X : Z) h(Z), so

I(X : Z) min h(X), h(Z) .
(b) Write
I(X : Y(n) ) = h(Y(n) ) h(Y(n) |X).

(1.2.37)

Then, if Y1 , . . . ,Yn are conditionally independent given X, the RHS of (1.2.37)

equals
n

h(Y(n) ) h(Y j |X)

j=1

h(Y j ) h(Y j |X) =

j=1

I(X : Y j ),

j=1

giving that of (1.2.35).

j=1

h(Y j ) h(Y(n) |X)

h(Y j ) h(Y j |X) =

I(X : Y j ),

j=1

giving the RHS of (1.2.36).

On the other hand, property (b) fails under the independence condition. Indeed,
set n = 2, with Y(2) = (Y1 ,Y2 ), and let Y1 and Y2 take values 0 or 1 with probabilities
1/2, j = 1, 2, independently, and set X = (Y1 +Y2 ) mod 2. Then
h(X) = h(X|Y j ) = 1, so I(X : Y j ) 0, j = 1, 2,
but
h(X|Y(2) ) = 0, so I(X : Y(2) ) = 1.
Also, (c) fails under the conditional independence condition. Indeed, take a
DTMC (U1 ,U2 , . . .) with states
1, the initial probability distribution {1/2, 1/2}
0 1
and the transition matrix
. Set
1 0
Y1 = U1 , X = U2 , Y2 = U3 .
Then Y1 , Y2 are conditionally independent given X: Y1 = Y2 = X. On the other
hand,
1 = I(X : Y(2) ) = h(Y(2) ) = h(Y1 ) = h(Y2 )
< h(Y1 ) + h(Y2 ) = I(X : Y1 ) + I(X : Y2 ) = 2.

Essentials of Information Theory

Recall that a real function f (y) defined on a convex set V Rm is called concave if
f (0 y(0) + 1 y(1) ) 0 f (y(0) ) + 1 f (y(1) )
for any y(0) , y(1) V and 0 , 1 [0, 1] with 0 + 1 = 1. It is called strictly concave
if the equality is attained only when either y(0) = y(1) or 0 1 = 0. We treat h(X) as
a function of variables p = (p1 , . . . , pm ); set V in this case is {y = (y1 , . . . , ym ) Rm :
yi 0, 1 i m, y1 + + ym = 1}.
Theorem 1.2.18
bution.

Entropy is a strictly concave function of the probability distri-

Proof Let the random variables X (i) have probability distributions p(i) , i = 0, 1,
and assume that the random variable takes values 0 and 1 with probabilities
0 and 1 , respectively, and is independent of X (0) , X (1) . Set X = X () ; then the
inequality h(0 p(0) + 1 p(1) ) 0 h(p(0) ) + 1 h(p(1) ) is equivalent to
h(X) h(X|)

(1.2.38)

which follows from (1.2.19). If we assume equality in (1.2.38), X and must be

independent. Assume in addition that 0 > 0 and write, by using independence,
P(X = i, = 0) = P(X = i)P( = 0) = 0 P(X = i).
(0)

The LHS equals 0 P(X = i| = 0) = 0 pi

(1)
1 pi . We may cancel 0 obtaining

(0)
and the RHS equals 0 0 pi +

(0)

(1)

(1 0 )pi = 1 pi ,
i.e. the probability distributions p(0) and p(1) are proportional. Then either they are
equal or 1 = 0, 0 = 1. The assumption 1 > 0 leads to a similar conclusion.
Worked Example 1.2.19

Show that the quantity

(X,Y ) = h(X|Y ) + h(Y |X)

obeys
(X,Y ) = h(X) + h(Y ) 2I(X : Y )
= h(X,Y ) I(X : Y ) = 2h(X,Y ) h(X) h(Y ).

Prove that is symmetric, i.e. (X,Y ) = (Y, X) 0, and satisfies the triangle
inequality, i.e. (X,Y ) + (Y, Z) (X, Z). Show that (X,Y ) = 0 iff X and Y
are functions of each other. Also show that if X and X are functions of each other
then (X,Y ) = (X ,Y ). Hence, may be considered as a metric on the set of
the random variables X , considered up to equivalence: X X iff X and X are
functions of each other.

1.2 Entropy: an introduction

Solution Check the triangle inequality

h(X|Z) + h(Z|X) h(X|Y ) + h(Y |X) + h(Y |Z) + h(Z|Y ),
or
h(X, Z) h(X,Y ) + h(Y, Z) h(Y ).
To this end, write h(X, Z) h(X,Y, Z) and note that h(X,Y, Z) equals
h(X, Z|Y ) + h(Y ) h(X|Y ) + h(Z|Y ) + h(Y )
= h(X,Y ) + h(Y, Z) h(Y ).
Equality holds iff (i) Y = (X, Z) and (ii) X, Z are conditionally independent
given Y .
Remark 1.2.20 The property that (X, Z) = (X,Y )+ (Y, Z) means that point
Y lies on a line through X and Z; in other words, that all three points X, Y , Z lie
on a straight line. Conditional independence of X and Z given Y can be stated
in an alternative (and elegant) way: the triple X Y Z satisfies the Markov
property (in short: is Markov). Then suppose we have four random variables X1 ,
X2 , X3 , X4 such that, for all 1 i1 < i2 < i3 4, the random variables Xi1 and
Xi3 are conditionally independent given Xi2 ; this property means that the quadruple
X1 X2 X3 X4 is Markov, or, geometrically, that all four points lie on a
line. The following fact holds: if X1 X2 X3 X4 is Markov then the mutual
entropies satisfy
I(X1 : X3 ) + I(X2 : X4 ) = I(X1 : X4 ) + I(X2 : X3 ).

(1.2.39)

Equivalently, for the joint entropies,

h(X1 , X3 ) + h(X2 , X4 ) = h(X1 , X4 ) + h(X2 , X3 ).

(1.2.40)

In fact, for all triples Xi1 , Xi2 , Xi3 as above, in the metric we have that

(Xi1 , Xi3 ) = (Xi1 , Xi2 ) + (Xi2 , Xi3 ),

which in terms of the joint and individual entropies is rewritten as
h(Xi1 , Xi3 ) = h(Xi1 , Xi2 ) + h(Xi2 , Xi3 ) h(Xi2 ).
Then (1.2.39) takes the form
h(X1 , X2 ) + h(X2 , X3 ) h(X2 ) + h(X2 , X3 ) + h(X3 , X4 ) h(X3 )
= h(X1 , X2 ) + h(X2 , X3 ) h(X2 ) + h(X3 , X4 ) + h(X2 , X3 ) h(X3 )
which is a trivial identity.

Essentials of Information Theory

Worked Example 1.2.21 Consider the following inequality. Let a triple X

Y Z be Markov where Z is a random string (Z1 , . . . , Zn ). Then

I(X : Zi ) I(X,Y ) + I Z where I Z := h(Zi ) h Z .
1in

1in

Solution The Markov property for X Y Z leads to the bound

I X : Z I(X : Y ).
Therefore, it suffices to verify that

I(X : Zi ) I Z I X : Z .

(1.2.41)

1in

As we show below, bound (1.2.41) holds for any X and Z (without referring to a
Markov property). Indeed, (1.2.41) is equivalent to

nh(X) h(X, Zi ) + h Z h(X) + h Z h X, Z
1in

h X, Z h(X)

h(X, Zi ) nh(X)

1in

which in turn is nothing but the inequality h Z|X h(Zi |X).
1in

Worked Example 1.2.22 Write h(p) := p j log p j for a probability vector

1

p1

p = ... , with entries p j 0 and p1 + + pm = 1.
pm
(a) Show that h(Pp) h(p) if P = (Pi j ) is a doubly stochastic matrix (i.e. a square
matrix with elements Pi j 0 for which all row and column sums are unity).
Moreover, h(Pp) h(p) iff P is a permutation matrix.
m

(b) Show that h(p) p j Pjk log Pjk if P is a stochastic matrix and p is an
j=1 k=1

invariant vector of P: Pp = p.
Solution (a) By concavity of the log-function x log x, for all i , ci 0 such
m

that i = 1, we have log(1 c1 + + m cm ) i log ci . Apply this to h(Pp) =

1
1

Pi j p j log Pik pk p j log Pi j Pik pk = p j log PT Pp j . By
i, j

i,k

the Gibbs inequality the RHS h(p). The equality holds iff PT Pp p, i.e. PT P =
I, the unit matrix. This happens iff P is a permutation matrix.

1.2 Entropy: an introduction

(b) The LHS equals h(Un ) for the stationary Markov source (U1 ,U2 , . . .) with equilibrium distribution p, whereas the RHS is h(Un |Un1 ). The general inequality
h(Un |Un1 ) h(Un ) gives the result.
Worked Example 1.2.23 The sequence of random variables {X j : j = 1, 2, . . .}
forms a DTMC with a finite state space.
(a) Quoting standard properties of conditional entropy, show that h(X j |X j1 )
h(X j |X j2 ) and, in the case of a stationary DTMC, h(X j |X j2 ) 2h(X j |X j1 ).
(b) Show that the mutual information I(Xm : Xn ) is non-decreasing in m and nonincreasing in n, 1 m n.
Solution (a) By the Markov property and stationarity
h(X j |X j1 ) = h(X j |X j1 , X j2 )
h(X j |X j2 ) h(X j , X j1 |X j2 )
= h(X j |X j1 , X j2 ) + h(X j1 |X j2 ) = 2h(X j |X j1 ).
(b) Write
I(Xm : Xn ) I(Xm : Xn+1 ) = h(Xm |Xn+1 ) h(Xm |Xn )
= h(Xm |Xn+1 ) h(Xm |Xn , Xn+1 ) (because Xm and
Xn+1 are conditionally independent, given Xn )
which is 0. Thus, I(Xm : Xn ) does not increase with n.
Similarly,
I(Xm1 : Xn ) I(Xm : Xn ) = h(Xn |Xm1 ) h(Xn |Xm , Xm1 ) 0.
Thus, I(Xm : Xn ) does not decrease with m.
Here, no assumption of stationarity has been used. The DTMC may not even be
time-homogeneous (i.e. the transition probabilities may depend not only on i and j
but also on the time of transition).
Worked Example 1.2.24

Given random variables Y1 , Y2 , Y3 , define

I(Y1 : Y2 |Y3 ) = h(Y1 |Y3 ) + h(Y2 |Y3 ) h(Y1 ,Y2 |Y3 ).

Now let the sequence Xn , n = 0, 1, . . . be a DTMC. Show that

I(Xn1 : Xn+1 |Xn ) = 0 and hence I(Xn1 : Xn+1 ) I(Xn : Xn+1 ).

Show also that I(Xn : Xn+m ) is non-increasing in m, for m = 0, 1, 2, . . . .

Essentials of Information Theory

Solution By the Markov property, Xn1 and Xn+1 are conditionally independent,
given Xn . Hence,
h(Xn1 , Xn+1 |Xn ) = h(Xn+1 |Xn ) + h(Xn1 |Xn )
and I(Xn1 : Xn+1 |Xn ) = 0. Also,
I(Xn : Xn+m ) I(Xn : Xn+m+1 )
= h(Xn+m ) h(Xn+m+1 ) h(Xn , Xn+m+1 ) + h(Xn , Xn+m )
= h(Xn |Xn+m+1 ) h(Xn |Xn+m )
= h(Xn |Xn+m+1 ) h(Xn |Xn+m , Xn+m+1 ) 0,
the final equality holding because of the conditional independence and the last
inequality following from (1.2.21).
Worked Example 1.2.25

(An axiomatic definition of entropy)

(a) Consider a probability distribution (p1 , . . . , pm ) and an associated measure of

uncertainty (entropy) such that
h(p1 q1 , p1 q2 , . . . , p1 qn , p2 , p3 , . . . , pm ) = h(p1 , . . . , pm ) + p1 h(q1 , . . . , qn ),
(1.2.42)
if (q1 , . . . , qn ) is another distribution. That is, if one of the contingencies (of
probability p1 ) is divided into sub-contingencies of conditional probabilities
q1 , . . . , qn , then the total uncertainty breaks up additively as shown. The functional h is assumed to be symmetric in its arguments, so that analogous relations holds if contingencies 2, 3, . . . , m are subdivided.

Suppose that F(m) := h 1/m, . . . , 1/m is monotone increasing in m. Show
that, as a consequence of (1.2.42), F(mk ) = kF(m) and hence that F(m) =
c log m for some constant c. Hence show that
h(p1 , . . . , pm ) = c p j log p j

(1.2.43)

if p j are rational. The validity of (1.2.43) for an arbitrary collection {p j } then

follows by a continuity assumption.
(b) An alternative axiomatic characterisation of entropy is as follows. If a symmetric function h obeys for any k < m
h(p1 , . . . , pm ) = h(p1 + + pk , pk+1 , . . . , pm )

p1
pk
,...,
+ (p1 + + pk )h
,
p1 + + pk
p1 + + pk

(1.2.44)

1.2 Entropy: an introduction

h(1/2, 1/2) = 1, and h(p, 1 p) is a continuous function of p [0, 1], then

h(p1 , . . . , pm ) = p j log p j .
j

Solution (a) Using (1.2.42), we obtain for the function F(m) = h 1/m, . . . , 1/m
the following identity:

1 1 1
1 1
1
2
F(m ) = h
,..., , 2 ,..., 2
m m
m m m
m

1 1
1
1
, , . . . , 2 + F(m)
=h
m m2
m
m
..
.

1
m
1
,...,
+ F(m) = 2F(m).
=h
m
m
m
The induction hypothesis is F(mk1 ) = (k 1)F(m). Then

1
1
1
1
1
1
(m ) = h
k1 , . . . , k1 , k , . . . , k
m m
m m
m
m

1
1
1
1
=h
, , . . . , k + F(m)
mk1 mk
m
m
..
.

m
1
1
, . . . , k1 + F(m)
=h
k1
m
m
m
= (k 1)F(m) + F(m) = kF(m).

Now, for given positive integers b > 2 and m, we can find a positive integer n such
that 2n bm 2n+1 , i.e.
n
n 1
log2 b + .
m
m m
By monotonicity of F(m), we obtain nF(2) mF(b) (n + 1)F(2), or
F(b)
n 1
n

+ .
m F(2) m m

F(b)
1

We conclude that log2 b
, and letting m , F(b) = c log b with

F(2)
m
c = F(2).

Essentials of Information Theory

r1
rm
, . . . , pm =
and obtain
r
r

r
rm
rm
r1
r1 1 r2
r1 1
1
,...,
,..., , ,...,
F(r1 )
h
=h
r
r
r r1
r r1 r
r
r
..
.

1
ri
1
,...,
c
log ri
=h
r
r
1im r
ri
ri
ri
log ri = c
log .
= c log r c
r
1im r
1im r

Now take rational numbers p1 =

(b) For the second definition

the point is that we do not assume the monotonicity of
F(m) = h 1/m, . . . , 1/m in m. Still, using (1.2.44), it is easy to check the additivity
property
F(mn) = F(m) + F(n)
for any positive integers m, n. Hence, for a canonical prime number decomposition
m = q1 1 . . . qs s we obtain
F(m) = 1 F(q1 ) + + s F(qs ).
Next, we prove that
F(m)
0, F(m) F(m 1) 0
m
as m . Indeed,

1
1
,...,
F(m) = h
m
m

m1
1
1
1 m1
,
h
,...,
+
,
=h
m m
m
m1
m1

i.e.

1 m1
m1
,
= F(m)
F(m 1).
m m
m

By continuity and symmetry of h(p, 1 p),

1 m1
lim h
,
= h(0, 1) = h(1, 0).
m
m m
But from the representations
1 1
1 1 1
h , ,0 = h ,
+ h(1, 0)
2 2
2 2
2

(1.2.45)

1.2 Entropy: an introduction

and (the symmetry again)

1 1
1 1
1 1
h , , 0 = h 0, ,
= h(1, 0) + h ,
2 2
2 2
2 2
we obtain h(1, 0) = 0. Hence,

m1
F(m 1) = 0.
lim F(m)
m
m

(1.2.46)

Next, we write
m

mF(m) =

F(k)

k=1

k1
F(k 1)
k

or, equivalently,

m

F(m) m + 1
2
k1
=
k F(k) k F(k 1) .
m
2m m(m + 1) k=1
The quantity in the square brackets is the arithmetic mean of m(m + 1)/2 terms of
a sequence
2
2
F(1), F(2) F(1), F(2) F(1), F(3) F(2), F(3) F(2),
3
3
k1
2
F(k 1), . . . ,
F(3) F(2), . . . , F(k)
3
k
k1
F(k)
F(k 1), . . .
k
that tends to 0. Hence, it goes to 0 and F(m)/m 0. Furthermore,
1

m1
F(m 1) F(m 1) 0,
F(m) F(m 1) = F(m)
m
m
and (1.2.46) holds. Now define
c(m) =

F(m)
,
log m

and prove that c(m) = const. It suffices to prove that c(p) = const for any prime
number p. First, let us prove that a sequence (c(p)) is bounded. Indeed, suppose the
numbers c(p) are not bounded from above. Then, we can find an infinite sequence
of primes p1 , p2 , . . . , pn , . . . such that pn is the minimal prime such that pn > pn1
and c(pn ) > c(pn1 ). By construction, if a prime q < pn then c(q) < c(pn ).

Essentials of Information Theory

Consider the canonical decomposition into prime factors of the number pn 1 =

q1 1 . . . qs s with q1 = 2. Then we write the difference F(pn ) F(pn 1) as
F(pn )
log(pn 1) + c(pn ) log(pn 1) F(pn 1)
log pn
s
pn
F(pn ) pn
+ j (c(pn ) c(q j )) log q j .
log
=
pn log pn
pn 1 j=1

F(pn )

The previous remark implies that

j (c(pn ) c(q j )) log q j (c(pn ) c(2)) log 2 = (c(pn ) c(2)).

(1.2.47)

j=1

p
p
log
= 0, equations (1.2.46) and (1.2.47) imply that
p log p
p1
c(pn ) c(2) 0 which contradicts with the construction of c(p). Hence, c(p) is
bounded from above. Similarly, we check that c(p) is bounded from below. Moreover, the above proof yields that sup p c(p) and inf p c(p) are both attained.
Now assume that c( p) = sup p c(p) > c(2). Given a positive integer m, decompose into prime factors pm 1 = q1 1 . . . qs s with q1 = 2. Arguing as before, we
write the difference F( pm ) F( pm 1) as
Moreover, as lim

F( pm )
log( pm 1) + c( p) log( pm 1) F( pm 1)
log pm
s
pm
F( pm ) pm
+
=
log
j (c( p) c(q j )) log q j

pm log pm

pm 1 j=1

F( pm )

c( pm ) pm
pm
+ (c( p) c(2)).
log
pm log pm
pm 1

As before, the limit m yields c( p) c(2) 0 which gives a contradiction.

Similarly, we can prove that inf p c(p) = c(2).Hence,
c(p) = c is a constant, and
F(m) = c log m. From the condition F(2) = h
in (a), we obtain

1 1
2, 2

= 1 we get c = 1. Finally, as

h(p1 , . . . , pm ) = pi log pi

(1.2.48)

i=1

for any rational p1 , . . . , pm 0 with pi = 1. By continuity argument (1.2.48) is

i=1

extended to the case of irrational probabilities.

Worked Example 1.2.26 Show that more homogeneous distributions have a
greater entropy. That is, if p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) are two probability distributions on the set {1, . . . , n}, then p is called more homogeneous than q

1.3 Shannons first coding theorem. The entropy rate of a Markov source

(p q, cf. [108]) if, after rearranging values p1 , . . . , pn and q1 , . . . , qn in decreasing

order:
p1 pn ,

q1 qn ,

one has
k

i=1

pi qi ,

for all k = 1, . . . , n.

i=1

Then
h(p) h(q) whenever p q.
Solution We write the probability distributions p and q as non-increasing functions
of a discrete argument
p p(1) p(n) 0, q q(1) q(n) 0,
with

p(i) = q(i) = 1.
i

Condition p q means that if p = q then there exist i1 and i2 such that (a) 1 i1
i2 n, (b) q(i1 ) > p(i1 ) p(i2 ) > q(i2 ) and (c) q(i) p(i) for 1 i i1 , q(i) p(i) for
i i2 .
Now apply induction in s, the number of values i = 1, . . . , n for which q(i) = p(i) .
If s = 0 we have p = q and the entropies coincide. Make the induction hypothesis
and then increase s by 1. Take a pair i1 , i2 as above. Increase q(i2 ) and decrease q(i1 )
so that the sum q(i1 ) +q(i2 ) is preserved, until either q(i1 ) reaches p(i1 ) or q(i2 ) reaches
p(i2 ) (see Figure 1.7). Property (c) guarantees that the modified distributions p q.
As the function x (x) = x log x (1 x) log(1 x) strictly increases on
[0, 1/2]. Hence, the entropy of the modified distribution strictly increases. At the
end of this process we diminish s. Then we use our induction hypothesis.

1.3 Shannons first coding theorem. The entropy rate of a

Markov source
A useful meaning of the information rate of a source is that it specifies the minimal rates of growth for the set of sample strings carrying, asymptotically, the full
probability.
Lemma 1.3.1

Let H be the information rate of a source (see (1.1.20)). Define

Dn (R) := max P(U(n) A) : A I n , A 2nR .

(1.3.1)

Essentials of Information Theory

Figure 1.7

Then for any > 0, as n ,

lim Dn (H + ) = 1, and, if H > 0, Dn (H ) 1.

(1.3.2)

Proof By definition, R := H + is a reliable encoding rate. Hence, there exists a

sequence of sets An I n , with An 2nR and P(U(n) An ) 1, as n . Since
Dn (R) P(U(n) An ), then Dn (R) 1.
Now suppose that H > 0, and take R := H ; for small enough, R > 0.
However, R is not a reliable rate. That is, there is no sequence An with the above
properties. Take a set Cn where the maximum in (1.3.1) is attained. Then Cn 2nR ,
but P(Cn ) 1.
Given a string u(n) = u1 . . . un , consider its log-likelihood value per sourceletter:
1
n (u(n) ) = log+ pn (u(n) ), u(n) I n ,
(1.3.3a)
n
where pn (u(n) ) := P(U(n) = u(n) ) is the probability assigned to string u(n) . Here
and below, log+ x = log x if x > 0, and is 0 if x = 0. For a random string, U(n) =
u1 , . . . , un ,
1
n (U(n) ) = log+ pn (U(n) )
(1.3.3b)
n
is a random variable.
Lemma 1.3.2

For all R, > 0,

P(n R) Dn (R) P(n R + ) + 2n .

(1.3.4)

1.3 Shannons first coding theorem. The entropy rate of a Markov source

Proof

For brevity, omit the upper index (n) in the notation u(n) and U(n) . Set
Bn := {u I n : pn (u) 2nR }
= {u I n : log pn (u) nR}
= {u I n : n (u) R}.

Then
1 P(U Bn ) =

pn (u) 2nR Bn , whence Bn 2nR .

uBn

Thus,

Dn (R) = max P(U An ) : An I n , A 2nR

P(U Bn ) = P(n R),

which proves the LHS in (1.3.4).

On the other hand, there exists a set Cn I n where the maximum in (1.3.1) is
attained. For such a set, Dn (R) = P(U Cn ) is decomposed as follows:
Dn (R) = P(U Cn , n R + ) + P(U Cn , n > R + )

P(n R + ) + pn (u)1 pn (u) < 2n(R+ )
uCn

< P(n R + ) + 2n(R+ ) Cn

= P(n R + ) + 2n(R+ ) 2nR
= P(n R + ) + 2n .

Definition 1.3.3 (See PSE II, p. 367.) A sequence of random variables {n }

converges in probability to a constant r if, for all > 0,

lim P |n r| = 0.
(1.3.5)
n

Replacing, in this definition, r by a random variable , we obtain a more general

definition of convergence in probability to a random variable.
P
Convergence in probability is denoted henceforth as n r (respectively,
P
n ).
Remark 1.3.4 It is precisely the convergence in probability (to an expected
value) that figures in the so-called law of large numbers (cf. (1.3.8) below). See
PSE I, p. 78.
Theorem 1.3.5 (Shannons first coding theorem (FCT)) If n converges in probability to a constant then = H , the information rate of a source.

Essentials of Information Theory

Proof Let n . Since n 0, 0. By Lemma 1.3.2, for any > 0,

Dn ( + ) P(n + ) P( n + )

= P |n | = 1 P |n | > 1 (n ).
Hence, H . In particular, if = 0 then H = 0. If > 0, we have, again by Lemma
1.3.2, that

Dn ( ) P(n /2) + 2n /2 P |n | /2 + 2n /2 0.
By Lemma 1.3.1, H . Hence, H = .
P

Remark 1.3.6 (a) Convergence n = H is equivalent to the following

asymptotic equipartition property: for any > 0,

(1.3.6)
lim P 2n(H+ ) pn (U(n) ) 2n(H ) = 1.
n

In fact,

P 2n(H+ ) pn (U(n) ) 2n(H )

1
(n)
= P H log pn (U ) H +
n

= P |n H| = 1 P |n H| > .

In other words, for all > 0 there exists n0 = n0 ( ) such that, for any n > n0 , the
set I n decomposes into disjoint subsets, n and Tn , with

(i) P U(n) n < ,

(ii) 2n(H+ ) P U(n) = u(n) 2n(H ) for all u(n) Tn .
Pictorially speaking, Tn is a set of typical strings and n is the residual set.
We conclude that, for a source with the asymptotic equipartition property, it is
worthwhile to encode the typical strings with codewords of the same length, and
the rest anyhow. Then we have the effective encoding rate H + o(1) bits/sourceletter, though the source emits log m bits/source-letter.
(b) Observe that
E n =

1
1
pn (u(n) ) log pn (u(n) ) = h(n) .

n u(n) I n
n

(1.3.7)

The simplest example of an information source (and one among the most
instructive) is a Bernoulli source.
Theorem 1.3.7

For a Bernoulli source U1 ,U2 , . . ., with P(Ui = x) = p(x),

H = p(x) log p(x).
x

1.3 Shannons first coding theorem. The entropy rate of a Markov source

Proof

For an IID sequence U1 , U2 , . . ., the probability of a string is

pn (u(n) ) = p(ui ), u(n) = u1 . . . un .

i=1

Hence, log pn (u) = log p(ui ). Denoting i = log p (Ui ), i = 1, 2, . . . , we

see that 1 , 2 , . . . form a sequence of IID random variables. For a random

string U(n) = U1 . . .Un , log pn (U(n) ) =
log p(Ui ) are IID.
Next, write n =

i , where the random variables i =

i=1

1 n
i . Observe that Ei = j p( j) log p( j) = h and
n i=1

E n = E

1 n
i
n i=1

1 n
1 n
Ei = h = h,

n i=1
n i=1

the final equality being in agreement with (1.3.7), since, for the Bernoulli source,
P
h(n) = nh (see (1.1.18)), and hence En = h. We immediately see that n h by
the law of large numbers. So H = h by Theorem 1.3.5 (FCT).
Theorem 1.3.8 (The law of large numbers for IID random variables) For any
sequence of IID random variables 1 , 2 , . . . with finite variance and mean Ei = r,
and for any > 0,

1 n
(1.3.8)
lim P | i r| = 0.
n
n i=1
Proof The proof of Theorem 1.3.8 is based on the famous Chebyshev inequality;
see PSE II, p. 368.
Lemma 1.3.9

For any random variable and any > 0,

P( )

Proof

1
E 2 .
2

See PSE I, p. 75.

Next, consider a Markov source U1 U2 . . . with

letters
from alphabet Im =
{1, . . . , m} and assume that the transition matrix P(u, v) (or rather its power)
obeys
min P(r) (u, v) = > 0 for some r 1.
u,v

(1.3.9)

Essentials of Information Theory

This condition means that the DTMC is irreducible and aperiodic. Then (see
PSE II, p. 71), the DTMC has a unique invariant (equilibrium) distribution
(1), . . . , (m):
m

0 (u) 1,

(u) = 1,

(v) =

u=1

(u)P(u, v),

(1.3.10)

u=1

and the n-step

probabilities P(n) (u, v) converge to (v) as well as the

transition
n1
(v) = P(Un = v):
probabilities P
lim P(n) (u, v) = lim P(Un = v) = lim (u)P(n) (u, v) = (v),

(1.3.11)

for all initial distribution { (u), u I}. Moreover, the convergence in (1.3.11) is
exponentially (geometrically) fast.
Theorem 1.3.10 Assume that condition (1.3.9) holds with r = 1. Then the DTMC
U1 ,U2 , . . . possesses a unique invariant distribution (1.3.10), and for any u, v I and
any initial distribution on I ,
|P(n) (u, v) (v)| (1 )n and |P(Un = v) (v)| (1 )n1 .

(1.3.12)

In the case of a general r 1, we replace, in the RHS of (1.3.12), (1 )n by

(1 )[n/r] and (1 )n1 by (1 )[(n1)/r] .
Proof See Worked Example 1.3.13.
Now we introduce an information rate H of a Markov source.
Theorem 1.3.11
H =

For a Markov source, under condition (1.3.9),

(u)P(u, v) log P(u, v) = lim h(Un+1 |Un );

1u,vm

(1.3.13)

if the source is stationary then H = h(Un+1 |Un ).

Proof We again use the Shannon FCT to check that n H where H is given by
1
(1.3.13), and n = log pn (U(n) ), cf. (1.3.3b). In other words, condition (1.3.9)
n
implies the asymptotic equipartition property for a Markov source.
The Markov property means that, for all string u(n) = u1 . . . un ,
pn (u(n) ) = (u1 )P(u1 , u2 ) P(un1 , un ),

(1.3.14a)

and log pn (u(n) ) is written as the sum

log (u1 ) log P(u1 , u2 ) log P(un1 , un ).

(1.3.14b)

For a random string, U(n) = U1 . . .Un , the random variable log pn (U(n) ) has a
similar form:
log (U1 ) log P(U1 ,U2 ) log P(Un1 ,Un ).

(1.3.15)

1.3 Shannons first coding theorem. The entropy rate of a Markov source

As in the case of a Bernoulli source, we denote

1 (U1 ) := log (U1 ), i (Ui1 ,Ui ) := log P(Ui1 ,Ui ), i 2,

(1.3.16)

and write

n =

n1
1
1 + i+1 .
n
i=1

(1.3.17)

The expected value of is

E 1 = (u) log (u)

(1.3.18a)

and, as P(Ui = v) = Pi1 (v) = (u)P(i1) (u, v),

Ei+1 = P(Ui = u,Ui+1 = u ) log P(u, u )

u,u

= Pi1 (u)P(u, u ) log P(u, u ),
u,u

i 1.

(1.3.18b)

Theorem 1.3.10 implies that lim Ei = H. Hence,

1 n
Ei = H,
n n
i=1

lim En = lim

n
P

and the convergence n H is again a law of large numbers, for the sequence
(i ):

1 n

(1.3.19)
lim P i H = 0.
n
n i=1

However, the situation here is not as simple as in the case of a Bernoulli source.
There are two difficulties to overcome: (i) Ei equals H only in the limit i ;
(ii) 1 , 2 , . . . are no longer independent. Even worse, they do not form a DTMC,
or even a Markov chain of a higher order. [A sequence U1 ,U2 , . . . is said to form a
DTMC of order k, if, for all n 1,
P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 , . . .)
= P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 ).
An obvious remark is that, in a DTMC of order k, the vectors U n =
(Un ,Un+1 , . . . ,Un+k1 ), n 1, form an ordinary DTMC.] In a sense, the memory in a sequence 1 , 2 , . . . is infinitely long. However, it decays exponentially:
the precise meaning of this is provided in Worked Example 1.3.14.

Essentials of Information Theory

Anyway, by using the Chebyshev inequality, we obtain

2

1 n

n
1

P i H 2 2 E (i H) .
n i=1

n
i=1

(1.3.20)

Theorem 1.3.11 immediately follows from Lemma 1.3.12 below.

Lemma 1.3.12

The expectation value in the RHS of (1.3.20) satisfies the bound

2
n

(i H)

C n,

(1.3.21)

i=1

where C > 0 is a constant that does not depend on n.

Proof See Worked Example 1.3.14.
By (1.3.21), the RHS of (1.3.20) becomes

C
and goes to zero as n .
n 2

Prove the following bound (cf. (1.3.12)):

Worked Example 1.3.13

|P(n) (u, v) (v)| (1 )n .

(1.3.22)

Solution (Compare with PSE II, p. 72.) First, observe that (1.3.12) implies the
second bound in Theorem 1.3.10 as well as (1.3.10). Indeed, (v) is identified as
the limit
lim P(n) (u, v) = lim P(n1) (u, u)P(u, v) = (u)P(u, v),

(1.3.23)

which yields (1.3.10). If (1), (2), . . . , (m) is another invariant probability

vector, i.e.
0 (u) 1,

(u) = 1,

u=1

(v) = (u)P(u, v),

then (v) = (u)P(n) (u, v) for all n 1. The limit n gives then
u

(v) = (u) lim P(n) (u, v) = (u) (v) = (v),

i.e. the invariant probability vector is unique.

To prove (1.3.22) denote
mn (v) = min P(n) (u, v),
u

Mn (v) = max P(n) (u, v).

(1.3.24)

1.3 Shannons first coding theorem. The entropy rate of a Markov source

Then
mn+1 (v) = min P(n+1) (u, v) = min
u

P(u, u)P(n) (u, v)

min P (u, v) P(u, u) = mn (v).

(n)

Similarly,
Mn+1 (v) = max P(n+1) (u, v) = max
u

P(u, u)P(n) (u, v)

max P(n) (u, v) P(u, u) = Mn (v).

Since 0 mn (v) Mn (v) 1, both mn (v) and Mn (v) have the limits
m(v) = lim mn (v) lim Mn (v) = M(v).
n

Furthermore, the difference M(v) m(v) is written as the limit

lim (Mn (v) mn (v)) = lim max (P(n) (u, v) P(n) (u , v)).

n u,u

So, if we manage to prove that

|P(n) (u, v) P(n) (u , v)| (1 )n ,
max

u,u ,v

(1.3.25)

then M(v) = m(v) for each v. Furthermore, denoting the common value M(v) =
m(v) by (v), we obtain (1.3.22)
|P(n) (u, v) (v)| Mn (v) mn (v) (1 )n .
To prove (1.3.25), consider a DTMC on I I, with states (u1 , u2 ), and transition
probabilities

P(u1 , v1 )P(u2 , v2 ), if u1 = u2 ,

P (u1 , u2 ), (v1 , v2 ) =
P(u, v), if u1 = u2 = u; v1 = v2 = v,

0, if u1 = u2 and v1 = v2 .
(1.3.26)

It is easy to check that P (u1 , u2 ), (v1 , v2 ) is indeed a transition probability matrix
(of size m2 m2 ): if u1 = u2 = u then

P
(u
,
u
),
(v
,
v
)
= P(u, v) = 1
1 2
1 2

v1 ,v2

whereas if u1 = u2 then

P (u1 , u2 ), (v1 , v2 ) = P(u1 , v1 ) P(u2 , v2 ) = 1
v1 ,v2

Essentials of Information Theory

(the inequalities 0 P (u1 , u2 ), (v1 , v2 ) 1 follow directly from the definition
(1.3.26)).
This is the so-called coupled DTMC on I I; we denote it by (Vn ,Wn ), n 1.
Observe that both components Vn and Wn are DTMCs with transition probabilities
P(u, v). More precisely, the components Vn and Wn move independently, until the
first (random) time when they coincide; we call it the coupling time. After time
the components Vn and Wn stick together and move synchronously, again with
transition probabilities P(u, v).
Suppose we start the coupled chain from a state (u, u ). Then
|P(n) (u, v) P(n) (u , v)|
= |P(Vn = v|V1 = u,W1 = u ) P(Wn = v|V1 = u,W1 = u )|
(because each component of (Vn ,Wn ) moves with the same transition probabilities)
= |P(Vn = v,Wn = v|V1 = u,W1 = u )
P(Vn = v,Wn = v|V1 = u,W1 = u )|
P(Vn = Wn |V1 = u,W1 = u )
= P( > n|V1 = u,W1 = u ).

(1.3.27)

Now, the probability obeys

P( = 1|V1 = u,W1 = u ) P(u, v)P(u , v) P(u , v) = ,
v

i.e. the complementary probability satisfies

P( > 1|V1 = u,W1 = u ) 1 .
By the strong Markov property (of the coupled chain),
P( > n|V1 = u,W1 = u ) (1 )n .

(1.3.28)

Bounds (1.3.28) and (1.3.27) together give (1.3.25).

Worked Example 1.3.14
bound:

Under condition (1.3.9) with r = 1 prove the following

2
|E (i H)(i+k H) | H + | log | (1 )k1 .

(1.3.29)

1.3 Shannons first coding theorem. The entropy rate of a Markov source

Solution For brevity, we assume i > 1; the case i = 1 requires minor changes.
Returning to the definition of random variables i , i > 1, write

E (i H)(i+k H)
= P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )
u,u v,v

log P(u, u ) H log P(v, v ) H .

(1.3.30)

Our goal is to compare this expression with

P
(u)P(u, u ) log P(u, u ) H

u,u v,v

(v)P(v, v ) log P(v, v ) H .

(1.3.31)

Observe that (1.3.31) in fact vanishes because the sum vanishes due to the defiv,v

nition (1.3.13) of H.
The difference between sums (1.3.30) and (1.3.31) comes from the fact that the
probabilities
P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )

= Pi1 (u)P(u, u )P(k1) (u , v)P(v, v )
and

Pi1 (u)P(u, u ) (v)P(v, v )

do not coincide. However, the difference of these probabilities in absolute value

does not exceed
|P(k1) (u , v) (v)| (1 )k1 .
As | log P( , ) H| H + | log |, we obtain (1.3.29).
Proofof Theorem 1.3.11. This is now easy to complete. To prove (1.3.21), expand
the square and use the additivity of the expectation:
2

n

E (i H) = E (i H)2
i=1

1in

E (i H)( j H) .

(1.3.32)

1i< jn

The first sum in (1.3.32) is OK: it contains n terms E(i H)2 each bounded by a

2
constant (say, C may be taken to be H + | log | ). Thus this sum is at most C n.

Essentials of Information Theory

It is the second sum that causes problems: it contains n(n 1) 2 terms. We bound
it as follows:

n

E (i H)( j H) |E (i H)(i+k H) | , (1.3.33)
i=1 k=1
1i< jn
and use (1.3.29) to finish the proof.
Our next theorem shows the role of the (relative) entropy in the asymptotic analysis of probabilities; see PSE I, p. 82.
Theorem 1.3.15 Let 1 , 2 , . . . be a sequence of IID random variables taking
values 0 and 1 with probabilities 1 p and p, respectively, 0 < p < 1. Then, for
any sequence kn of positive integers such that kn and n kn as n ,

P

i = kn

(2 np (1 p ))1/2 exp (nD(p||p )) .

(1.3.34)

i=1

Here, means that the ratio of the left- and right-hand sides tends to 1 as n ,
kn
p (= pn ) denotes the ratio , and D(p||p ) stands for the relative entropy h(X||Y )
n
where X is distributed as i (i.e. it takes values 0 and 1 with probabilities 1 p
and p), while Y takes the same values with probabilities 1 p and p .
Proof Use Stirlings formula (see PSE I, p.72):

n! 2 nnn en .

(1.3.35)

[In fact, this formula admits a more precise form: n! = 2 nnn en+ (n) , where
1
1
< (n) <
, but for our purposes (1.3.35) is enough.] Then the proba12n + 1
12n
bility in the LHS of (1.3.34) is (for brevity, the subscript n in kn is omitted)

1/2

n k
n
nn
nk
p (1 p)

pk (1 p)nk
k
2 k(n k)
kk (n k)nk

1/2
= 2 np (1 p )
exp [k ln k/n (n k) ln (n k)/n + k ln p + (n k) ln(1 p)] .
But the RHS of the last formula coincides with the RHS of (1.3.34).
If p is close to p, we can write

1
1 1

+
D(p||p ) =
(p p)2 + O(|p p|3 ),
2 p 1 p

d
) |
D(p||p
as D(p||p )| p =p =
p =p = 0, and immediately obtain
dp

(1.3.36)

1.3 Shannons first coding theorem. The entropy rate of a Markov source

Corollary 1.3.16 (The local De MoivreLaplace theorem; cf. PSE I, p. 81) If

n(p p) = kn np = o(n2/3 ) then

n
1
n

2
(p p) .
P i = kn $
exp
(1.3.37)
2p(1 p)
2 np(1 p)
i=1
Worked Example 1.3.17 At each time unit a device reads the current version
of a string of N characters each of which may be either 0 or 1. It then transmits
the number of characters which are equal to 1. Between each reading the string is
perturbed by changing one of the characters at random (from 0 to 1 or vice versa,
with each character being equally likely to be changed). Determine an expression
for the information rate of this source.
Solution The source is Markov, with the state space
probability matrix

0
1
0
0
1/N
0
(N

1)/N
0

2/N
0
(N 2)/N
0

...

0
0
0
0
0
0
0
0

{0, 1, . . . , N} and the transition

...
...
...
...
...
...

0
0
0

0 1/N
1
0

The DTMC is irreducible and periodic. It possesses a unique invariant distribution

N
, 0 i N.
i = 2N
i
By Theorem 1.3.11,
H = i P(i, j) log P(i, j) = 21N
i, j

1 N1 N
N
j j log j .
N j=1

Worked Example 1.3.18 A stationary source emits symbols 0, 1, . . . , m (m 4 is

an even number), according to a DTMC, with the following transition probabilities
p jk = P(Un+1 = k | Un = j):
p j j+2 = 1/3, 0 j m 2, p j j2 = 1/3, 2 j m,
p j j = 1/3, 2 j m 2, p00 = p11 = pm1m1 = pmm = 2/3.

The distribution of the first symbol is equiprobable. Find the information rate of
the source. Does the result contradict Shannons FCT?

Essentials of Information Theory

How does the answer change if m is odd? How can you use, for m odd, Shannons
FCT to derive the information rate of the above source?
Solution For m even, the DTMC is reducible: there are two communicating classes,
I1 = {0, 2, . . . , m} with m/2 + 1 states, and I2 = {1, 3, . . . , m 1} with m/2 states.
Correspondingly, for any set An of n-strings,
P(An ) = qP1 (An1 ) + (1 q)P2 (An2 ),

(1.3.38)

where An1 = An I1 and An1 = An I2 ; Pi refers to the DTMC on class Ii , i = 1, 2,

and q = P(U1 I1 ).
1
The random variable from (1.3.3b) is n = log pn (U(n) ); according to
n
(1.3.38),
1
n = log pn1 (U(n) ) with probability q,
n
1
= log pn2 (U(n) ) with probability 1 q.
n

(1.3.39)

Both DTMCs are irreducible and aperiodic on their communicating classes and
their invariant distributions are uniform:
(1)

2
2
(2)
, i I1 , i = , i I2 .
m+2
m

Their information rates equal, respectively,

H (1) = log 3

8
8
and H (2) = log 3
.
3(m + 2)
3m

(1.3.40)

As follows from (1.3.38), the information rate of the whole DTMC equals
% (1)
H = max [H (1) , H (2) ], if 0 < q 1,
Hodd =
(1.3.41)
H (2) , if q = 0.
For 0 < q < 1 Shannons FCT is not applicable:
1
1
P1
P2
H (1) whereas log pn2 (U(n) )
H (2) ,
log pn1 (U(n) )
n
n
i.e. n converges to a non-constant limit. However, if q(1 q) = 0, then (1.3.41)
is reduced to a single line, and Shannons FCT is applicable: n converges to the
corresponding constant H (i) .
If m is odd, again there are two communicating classes, I1 = {0, 2, . . . , m 1}
and I2 = {1, 3, . . . , m}, each of which now contains (m + 1)/2 states. As before,

1.3 Shannons first coding theorem. The entropy rate of a Markov source

DTMCs P1 and P2 are irreducible and aperiodic and their invariant distributions
are uniform:
2
2
(1)
(2)
, i I1 , i =
, i I2 .
i =
m+1
m+1
Their common information rate equals
Hodd = log 3

8
,
3(m + 1)

(1.3.42)

which also gives the information rate of the whole DTMC. It agrees with Shannons
FCT, because now
1
P
n = log pn (U(n) ) Hodd .
(1.3.43)
n

Worked Example 1.3.19 Let a be the size of A and b the size of the alphabet B.
Consider a source with letters chosen from an alphabet A + B, with the constraint
that no two letters of A should ever occur consecutively.
(a) Suppose the message follows a DTMC, all characters which are permitted at a
given place being equally likely. Show that this source has information rate
H=

a log b + (a + b) log(a + b)
.
2a + b

(1.3.44)

(b) By solving a recurrence relation, or otherwise, find how many strings of length
n satisfy the constraint that no two letters of A occur consecutively. Suppose
these strings are equally likely and let n . Show that the limiting information rate becomes

b + b2 + 4ab
.
H = log
2

Why are the answers different?

Solution (a) The transition probabilities of the DTMC are given by

if x, y {1, . . . , a},

0,
P(x, y) = 1/b,
if x {1, . . . , a}, y {a + 1, . . . , a + b},

1/(a + b), if x {a + 1, . . . , a + b}, y {1, . . . , a + b}.

(2)
The chain is irreducible and
aperiodic. Moreover, min P (x, y) > 0; hence, an
invariant distribution = (x), x {1, . . . , a + b} is unique. We can find from

Essentials of Information Theory

the detailed balance equations (DBEs) (x)P(x, y) = (y)P(y, x) (cf. PSE II, p. 82),
which yields
'
1 (2a + b),
x {1, . . . , a},
(x) =

(a + b) [b(2a + b)], x {a + 1, . . . , a + b}.
The DBEs imply that is invariant: (y) = (x)P(x, y), but not vice versa. Thus,
x

we obtain (1.3.44).
(b) Let Mn denote the number of allowed n-strings, An the number of allowed
n-strings ending with a letter from A, and Bn the number of allowed n-strings
ending with a letter from B. Then
Mn = An + Bn , An+1 = aBn , and Bn+1 = b(An + Bn ),
which yields
Bn+1 = bBn + abBn1 .
The last recursion is solved by
Bn = c+ +n + c n ,
where are the eigenvalues of the matrix

0 ab
,
1 b
i.e.

b b2 + 4ab
,
=
2
and c are constants, c+ > 0. Hence,

n1
n1
Mn = a c
+ c+ +n + c n
+ + + c

n1
n

1
+1
,
= +n c a n + n + c+ a
+
+
+
1
log Mn is represented as the sum
n

n1 n
1
1
.
+1
log + + log c a n + n + c+ a
n
+
+
+

Note that < 1. Thus, the limiting information rate equals
+
and

1
log Mn = log + .
n n
lim

1.3 Shannons first coding theorem. The entropy rate of a Markov source

The answers are different since the conditional equidistribution results in a strong
dependence between subsequent letters: they do not form a DTMC.
Worked Example 1.3.20 Let {U j : j = 1, 2, . . .} be an irreducible and aperiodic
DTMC with a finite state space. Given n 1 and (0, 1), order the strings u(n)
(n)
(n)
according to their probabilities P(U(n) = u1 ) P(U(n) = u2 ) and select
them in this order until the probability of the remaining set becomes 1 . Let
1
Mn ( ) denote the number of the selected strings. Prove that lim log Mn ( ) = H ,
n n
the information rate of the source,
(a) in the case where the rows of the transition probability matrix P are all equal
(i.e. {U j } is a Bernoulli sequence),
(b) in the case where the rows of P are permutations of each other, and in a general
case. Comment on the significance of this result for coding theory.
Solution (a) Let P stand for the probability distribution of the IID sequence (Un )
m

and set H = p j log p j (the binary entropy of the source). Fix > 0 and partij=1

tion the set I n of all n-strings into three disjoint subsets:

K+ = {u(n) : p(u(n) ) 2n(H ) }, K = {u(n) : p(u(n) ) 2n(H+ ) },
and
K = {u(n) : 2n(H+ ) < p(u(n) ) < 2n(H ) }.
1
By the law of large numbers (or asymptotic equipartition property), log P(U(n) )
n
converges to H(= h), i.e. lim P(K+ K ) = 0, and lim P(K) = 1. Thus, to obtain
n
n
probability , for n large enough, you (i) cannot restrict yourself to K+ and have
to borrow strings from K , (ii) dont need strings from K , i.e. will have the last
selected string from K . Denote by Mn ( ) the set of selected strings, and Mn ( )
by Mn . You have two two-side bounds

P Mn ( ) + 2n(H )
and

2n(H+ ) Mn ( ) P Mn ( ) P(K+ ) + 2n(H ) Mn ( ).

Excluding P Mn ( ) yields
2n(H+ ) Mn ( ) + 2n(H ) and 2n(H ) Mn ( ) P(K+ ).

Essentials of Information Theory

These inequalities imply, respectively,

1
1
lim sup log Mn ( ) H + and lim inf log Mn ( ) H .
n
n
n
n
As is arbitrary, the limit is H.
(b) The argument may be repeated without any change in the case of permutations
because the ordered probabilities form the same set as in case (a) , and in a general
case by applying the law of large numbers to (1/n)n ; cf. (1.3.3b) and (1.3.19).
Finally, the significance for coding theory: if we are prepared to deal with the errorprobability , we do not need to encode all mn string u(n) but only 2nH most
frequent ones. As H log m (and in many cases log m), it yields a significant
economy in storage space (data-compression).
Worked Example 1.3.21
rule

A binary source emits digits 0 or 1 according to the

P(Xn = k|Xn1 = j, Xn2 = i) = qr ,

where k, j, i and r take values 0 or 1, r = k j i mod 2, and q0 +q1 = 1. Determine

the information rate of the source.
Also derive the information rate of a binary Bernoulli source, emitting digits
0 and 1 with probabilities q0 and q1 . Explain the relationship between these two
results.
Solution The source is a DTMC of the second order. That is, the pairs (Xn , Xn+1 )
form a four-state DTMC, with
P(00, 00) = q0 , P(00, 01) = q1 , P(01, 10) = q0 , P(01, 11) = q1 ,
P(10, 00) = q0 , P(10, 01) = q1 , P(11, 10) = q0 , P(11, 11) = q1 ;
the remaining eight entries of the transition probability matrix vanish. This gives
H = q0 log q0 q1 log q1 .
For a Bernoulli source the answer is the same.
Worked Example 1.3.22 Find an entropy rate of a DTMC associated with a
random walk on the 3 3 chessboard:

1 2 3
4 5 6 .
(1.3.45)
7 8 9

Find the entropy rate for a rook, bishop (both kinds), queen and king.

1.4 Channels of information transmission

Solution We consider the kings DTMC only; other cases are similar. The transition
probability matrix is

0 1/3 0 1/3 1/3 0

0
0
0
1/5 0 1/5 1/5 1/5 1/5 0
0
0

0 1/3 0
0 1/3 1/3 0
0
0

1/5 1/5 0
0 1/5 0 1/5 1/5 0

1/8
1/8
1/8
1/8
0
1/8
1/8
1/8
1/8

0 1/5 1/5 0 1/5 0

0 1/5 1/5

0
0
1/3
1/3
0
0
0
1/3

0
0
0 1/5 1/5 1/5 1/5 0 1/5
0
0
0
0 1/3 1/3 0 1/3 0
By symmetry the invariant distribution is 1 = 3 = 9 = 7 = , 4 = 2 = 6 =
8 = , 5 = , and by the DBEs

/3 = /5, /3 = /8, 4 + 4 + = 1
implies =

3
40 ,

= 18 , = 15 . Now

1
1
1
1
3
1
1
1
log 15 + .
H = 4 log 4 log log =
3
3
5
5
8
8 10
40

1.4 Channels of information transmission. Decoding rules. Shannons

second coding theorem
In this section we prove a core statement of Shannons theory: the second coding
theorem (SCT), also known as the noisy coding theorem (NCT). Shannon stated
its assertion and gave a sketch of its proof in his papers and books in the 1940s.
His argument was subject to (not entirely unjustified) criticism by professional
mathematicians. It took the mathematical community about a decade to produce
a rigorous and complete proof of the SCT. However, with hindsight, one cannot
stop admiring Shannons intuition and his firm grasp of fundamental notions such
as entropy and coding as well their relation to statistics of long random strings.
We point at various aspects of this topic, not avoiding a personal touch palpable in
writings of the main players in this area.
So far, we have considered a source emitting a random text U1 U2 . . ., and an
encoding of a message u(n) by a binary codeword x(N) using a code fn : I n J N ,
J = {0, 1}. Now we focus upon the relation between the length of a message n and
the codeword-length N: it is determined by properties of the channel through which
the information is sent. It is important to remember that the code fn is supposed to
be known to the receiver.

Essentials of Information Theory

Typically, a channel is subject to noise which distorts the messages transmitted:

a message at the output differs in general from the message at the input. Formally,
a channel is characterised by a conditional distribution

(1.4.1)
Pch receive word y(N) |codeword x(N) sent ;
we again suppose
that it is knownto both sender and receiver.
(We use a distinct

(N)
(N)
symbol Pch |codeword x sent or, briefly, Pch |x
, to stress that this probability distribution is generated by the channel, conditional on the event that codeword x(N) has been sent.) Speaking below of a channel, we refer to a conditional
probability (1.4.1) (or rather a family of conditional probabilities, depending on
N). Consequently, we use the symbol Y(N) for a random string representing the
output of the channel; given that a word x(N) was sent,
Pch (Y(N) = y(N) |x(N) ) = Pch (y(N) |x(N) ).
An important example is the so-called memoryless binary channels (MBCs)
where

N
Pch y(N) |x(N) = P(yi |xi ),
(1.4.2)
i=1

if y(N) = y1 . . . yN , x(N) = x1 . . . xN . Here, P(y|x), x, y = 0, 1, is a symbol-to-symbol

channel probability (i.e. the conditional probability to have symbol y at the output of the channel given that symbol x has been sent). Clearly, {P(y|x)} is a
2 2 stochastic matrix (often called the channel matrix). In particular, if P(1|0) =
P(0|1) = p, the channel is called symmetric (MBSC). The channel matrix then has
the form

1 p
p
=
p
1 p
and p is called the row error-probability (or the symbol-error-probability).
Example 1.4.1 Consider the memoryless channel, where Y = X + Z, and an
additive noise Z takes values 0 and a with probability 1/2; a is a given real number.
The input alphabet is {0, 1} and Z is independent of X.
Properties of this channel depend on the value of a. Indeed, if a = 1, the channel is uniquely decodable. In other words, if we have to use the channel for transmitting messages (strings) of length n (there are 2n of them altogether) then any
message can be sent straightaway, and the receiver will be able to recover it. But
if a = 1, there are errors possible, and to make sure that the receiver can recover
our message we have to encode it, which, typically, results in increasing the length
of the string sent into the channel, from n to N, say.

1.4 Channels of information transmission

In other words, strings of length N sent to the channel will be codewords representing source messages of a shorter length n. The maximal ratio n/N which still
allows the receiver to recover the original message is an important characteristic of
the channel, called the capacity. As we will see, passing from a = 1 to a = 1
changes the capacity from 1 (no encoding needed) to 1/2 (where the codewordlength is twice as long as the length of the source message).
So, we need to introduce a decoding rule fN : J N I n such that the overall
probability of error (= ( fn , fN , P)) defined by

= P fN (Y(N) ) = u(n) , u(n) emitted
u(n)

= P U(n) = u(n) Pch fN (Y(N) ) = u(n) | fn (u(n) ) sent

(1.4.3)

u(n)

is small. We will try (and under certain conditions succeed) to have the errorprobability (1.4.3) tending to zero as n .
The idea which is behind the construction is based on the following facts:
(1) For a source with the asymptotic equipartition property the number of distinct n-strings emitted is 2n(H+o(1)) where H log m is the information rate of
the source. Therefore, we have to encode not mn = 2n log m messages, but only
2n(H+o(1)) which may be considerably less. That is, the code fn may be defined
on a subset of I n only, with the codeword-length N = nH.
1

(2) We may try even a larger N: N = R nH, where R is a constant with 0 < R <
1. In other words, the increasing length of the codewords used from nH to
1
R nH will allow us to introduce a redundancy in the code fn , and we may
hope to be able to use this redundancy to diminish the overall error-probability
(1.4.3) (provided that in addition a decoding rule is good). It is of course
1
desirable to minimise R , i.e. maximise R: it will give the codes with optimal
parameters. The question of how large R is allowed to be depends of course on
the channel.
It is instrumental to introduce a notational convention. As the codeword-length is
1
a crucial parameter, we write N instead of R Hn and RN instead of Hn: the number of distinct strings emitted by the source becomes 2N(R+o(1)) . In future, the index
NR
n
will be omitted wherever possible (and replaced by N otherwise). It is conH
venient to consider a typical set UN of distinct strings emitted by the source, with
UN = 2N(R+o(1)) . Formally, UN can include strings of different length; it is only
the log-asymptotics of UN that matter. Accordingly, we will omit the superscript
(n) in the notation u(n) .

Essentials of Information Theory

Definition 1.4.2 A value R (0, 1) is called a reliable transmission rate (for a

given channel) if, given that the source strings take equiprobable values from a set
UN with UN = 2N(R+o(1)) , there exist an encoding rule fN : UN XN J N and
a decoding rule fN : J N UN with the error-probability

1
(1.4.4)
UN Pch fN (Y(N) ) = u| fN (u) sent
uUN
1

log UN = R,
N N
J N , and a sequence

tending to zero as N . That is, for each sequence UN with lim

there exist a sequence of encoding rules fN : UN XN , XN

of decoding rules fN : J N UN such that

1
1 fN (Y(N) ) = u Pch Y(N) | fN (u) = 0.
lim

N UN
uUN Y(N)

The channel capacity is defined as the supremum

C = sup R (0, 1) : R is a reliable transmission rate .

(1.4.5)

Definition 1.4.3

(1.4.6)

Remark 1.4.4

(a) Physically speaking, the channel capacity can be thought of

1
as a limit lim log n(N) where n(N) is the maximal number of strings of length
N N
N which can be sent through the channel with a vanishing probability of erroneous
decoding.
(b) The reason for the equiprobable distribution on UN is that it yields the worstcase scenario. See Theorem 1.4.6 below.
(c) If encoding rule fN used is one-to-one (lossless) then it suffices to treat the
decoding rules as maps J N XN rather than J N UN : if we guess correctly
what codeword x(N) has been sent, we simply set u = fN1 (x(N) ). If, in addition,
the source distribution is equiprobable over U then the error-probability can be
written as an average over the set of codewords XN :

1
=
1 Pch fN (Y(N) ) = x|x sent .

X xXN
Accordingly, it makes sense to write = ave and speak about the average probability of error. Another form is the maximum error-probability

max = max 1 Pch fN (Y(N) ) = x|x sent : x XN ;
obviously, ave max . In this section we work with ave 0 leaving the question
of whether max 0. However, in Section 2.2 we reduce the problem of assessing
max to that with ave , and as a result, the formulas for the channel capacity deduced
in this section will remain valid if ave is replaced by max .

1.4 Channels of information transmission

Remark 1.4.5
given by

(a) By Theorem 1.4.17 below, the channel capacity of an MBC is

C = sup I(Xk : Yk ).

(1.4.7)

pXk

Here, I(Xk : Yk ) is the mutual information between a single pair of input and output
letters Xk and Yk (the index k may be omitted), with the joint distribution
P(X = x,Y = y) = pX (x)P(y|x), x, y = 0, 1,

(1.4.8)

where pX (x) = P(X = x). The supremum in (1.4.7) is over all possible distributions
pX = (pX (0), pX (1)). A useful formula is I(X : Y ) = h(Y ) h(Y |X) (see (1.3.12)).
In fact, for the MBSC
h(Y |X) =

x=0,1

pX (x)

P(y|x) log P(y|x)

y=0,1

P(y|x) log P(y|x) = h2 (p, 1 p) = 2 (p);

(1.4.9)

y=0,1

the lower index 2 will be omitted for brevity.

Hence h(Y |X) = (p) does not depend on input distribution pX , and for the
MBSC
C = sup h(Y ) (p).

(1.4.10)

But sup h(Y ) is equal to log 2 = 1: it is attained at pX (0) = pX (1) = 1/2, and
pX

pY (0) = pY (1) = 1/2(p + 1 p) = 1/2. Therefore, for an MBSC, with the row
error-probability p,
C = 1 (p).

(1.4.11)

(b) Suppose we have a source U1 U2 . . . with the asymptotic equipartition property

and information rate H. To send a text emitted by the source through a channel
of capacity C we need to encode messages of length n by codewords of length
n(H + )
in order to have the overall error-probability tending to zero as n
C
. The value > 0 may be chosen arbitrarily small. Hence, if H/C < 1, a text
can be encoded with a higher speed than it is produced: in this case the channel
is used reliably for transmitting information from the source. On the contrary, if
H/C > 1, the text will be produced with a higher speed than we can encode it and
send reliably through a channel. In this case reliable transmission is impossible.
For a Bernoulli or stationary Markov source and an MBSC, condition H/C < 1 is
equivalent to h(U) + (p) < 1 or h(Un+1 |Un ) + (p) < 1 respectively.

Essentials of Information Theory

In fact, Shannons ideas have not been easily accepted by leading contemporary
mathematicians. It would be interesting to see the opinions of the leading scientists
who could be considered as creators of information theory.
Theorem 1.4.6 Fix a channel (i.e. conditional probabilities Pch in (1.4.1)) and a
set U of the source strings and denote by (P) the overall error-probability (1.4.3)
for U(n) having a probability distribution P over U , minimised over all encoding
and decoding rules. Then

(P) (P0 ),

(1.4.12)

where P0 is the equidistribution over U .

Proof Fix encoding and decoding rules f and f, and let a string u U have
probability P(u). Define the error-probability when u is emitted as

(u) :=

Pch (y| f (u)).

y: f(y)=u

The overall error-probability equals

(= (P, f , f)) =

P(u) (u).

If we permute the allocation of codewords (i.e. encode u by f (u ) where u = (u)

and is a permutation of degree U ), we get the overall error-probability ( )

1
(equidistribution), ( ) does not
= P(u) ( (u)). In the case P(u) = U
uU

depend on and is given by

1
U

(u) = (P0 , f , f).

It is claimed that for each probability distribution {P(u), u U }, there exists

such
that ( ) . In fact, take a random permutation, , equidistributed among
all U ! permutations of degree U . Then
min ( ) E () = E P(u) (u)

= P(u)E (u) = P(u)

1
(u) = .
U uU

Hence, given any f and f, we can find new encoding and decoding rules with
overall error-probability (P0 , f , f). Minimising over f and f leads to (1.4.12).

1.4 Channels of information transmission

Worked Example 1.4.7

Let h(X | Y ) denote the conditional entropy of X given Y Define the ideal observer
decoding rule as a map f IO : J I such that P( f (y) | y) = maxxI P(x | y) for all
y J . Show that
(a) under this rule the error-probability

erIO (y) = 1(x = f (y))P(x | y)

1
satisfies erIO (y) h(P( | y));
2
1
(b) the expected value of the error-probability obeys EerIO (Y ) h(X | Y ).
2
Solution Indeed, (a) follows from (iii) in Worked Example 1.2.7, as

IO
err
= 1 P f (y) | y = 1 Pmax ( |y),
1
h(P( |y)). Finally, (b) follows from (a) by taking
2
expectations, as h(X|Y ) = Eh(P( |Y )).
which is less than or equal to

As was noted before, a general decoding rule (or a decoder) is a map fN : J N
UN ; in the case of a lossless encoding rule fN , fN is a map J N XN . Here X
is a set of codewords. Sometimes it is convenient to identify the decoding rule by
(N)
(N)
fixing, for each codeword x(N) , a set A(x(N) ) J N , so that A(x1 ) and A(x2 ) are
(N)
(N)
disjoint for x1 = x2 , and the union x(N) XN A(x(N) ) gives the whole J N . Given
that y(N) A(x(N) ), we decode it as fN (y(N) ) = x(N) .
Although in the definition of the channel capacity we assume that the source
messages are equidistributed (as was mentioned, it gives the worst case in the
sense of Theorem 1.4.6), in reality of course the source does not always follow
this assumption. To this end, we need to distinguish between two situations: (i) the
receiver knows the probabilities
p(u) = P(U = u)

(1.4.13)

of the source strings (and hence the probability distribution pN (x(N) ) of the codewords x(N) XN ), and (ii) he does not know pN (x(N) ). Two natural decoding rules
are, respectively,

Essentials of Information Theory

(i) the ideal observer (IO) rule decodes a received word y(N) by a codeword x(N)
that maximises the posterior probability
p (x(N) )P (y(N) |x(N) )

N
ch
,
(1.4.14)
P x(N) sent |y(N) received =
pY(N) (y(N) )
where
pY(N) (y(N) ) =

pN (x(N) )Pch (y(N) |x(N) ),

x(N) XN

and
(ii) the maximum likelihood (ML) rule decodes a received word y(N) by a codeword
(N)
x that maximises the prior probability
Pch (y(N) |x(N) ).

(1.4.15)

Theorem 1.4.8 Suppose that an encoding rule f is defined for all messages that
occur with positive probability and is one-to-one. Then:
(a) For any such encoding rule, the IO decoder minimises the overall errorprobability among all decoders.
(b) If the source message U is equiprobable on a set U , then for any encoding rule
f : U XN as above, the random codeword X(N) = f (U) is equiprobable on
XN , and the IO and ML decoders coincide.
Proof Again, for simplicity let us omit the upper index (N).
(a) Note that, given a received word y, the IO obviously maximises the joint
probability p(x)Pch (y|x) (the denominator in (1.4.14) is fixed when word y is
fixed). If we use an encoding rule f and decoding rule f, the overall errorprobability (see (1.4.3)) is

P(U = u)Pch f(y) = u| f (u) sent
u

= p(x) 1 f(y) = x Pch (y|x)
x
y

= 1 x = f(y) p(x)Pch (y|x)
y x

= p(x)Pch (y|x) p f(y) Pch y| f(y)
y x

y

= 1 p f (y) Pch y| f(y) .
y

It remains to note that each term in the sum p f(y) Pch y| f(y) is maximised
y

when f coincides with the IO rule. Hence, the whole sum is maximised, and the
overall error-probability minimised.
(b) The first statement is obvious, as, indeed is the second.

1.4 Channels of information transmission

Assuming in the definition of the channel capacity that the source messages are
equidistributed, it is natural to explore further the ML decoder. While using the ML
decoder, an error can occur because either the decoder chooses a wrong codeword
x or an encoding rule f used is not one-to-one. The probability of this is assessed
in Theorem 1.4.8. For further simplification, we write P instead of Pch ; symbol P
is used mainly for the joint input/output distribution.
Lemma 1.4.9 If the source messages are equidistributed over a set U then, while
using the ML decoder and an encoding rule f , the overall error-probability satisfies

( f )
Proof

1
U

uU u U : u =u

P P Y| f (u ) P (Y| f (u)) |U = u .

(1.4.16)

If the source emits u and the ML decoder is used, we get

Multiplying by

1
and summing up over u yields the result.
U

Remark 1.4.10

Bound (1.4.16) of course holds for any probability distribution

1
is replaced by p(u).
p(u) = P(U = u), provided
U

As was already noted, a random coding is a useful tool alongside with deterN
ministic encoding rules. A deterministic encoding rule
( is a map f : U) J ; if
U = r then f is given as a collection of codewords f (u1 ), . . . , f (ur ) or, equivalently, as a concatenated megastring (or codebook)

r
= {0, 1}Nr .
f (u1 ) . . . f (ur ) J N
Here, u1 , . . . , ur are the source strings (not letters!) constituting set U . If f is lossless then f (ui ) =
N r rule is a random eler f (u j ) whenever i = j. A random encoding
N
ment F of J
, with probabilities P(F = f ), f J
. Equivalently, F may

Essentials of Information Theory

be regarded as a collection of random codewords F(ui ), i = 1, . . . , r, or, equivalently, as a random codebook

F(u1 )F(u2 ) . . . F(ur ) {0, 1}Nr .
A typical example is where codewords F(u1 ), F(u2 ), . . . , F(ur ) are independent,
and (random) symbols Wi1 , . . . ,WiN constituting word F(ui ) are independent too.
The reasons for considering random encoding rules are:
(1) the existence of a good deterministic code frequently follows from the existence of a good random code;
(2) the calculations for random codes are usually simpler than for optimal deterministic codes, because a discrete optimisation is replaced by an optimisation over
probability distributions.
A drawback of random coding is that it is not always one-to-one (F(u) may
coincide with F(u ) for u = u ). However, this occurs, for large N, with negligible
probability.
The idea of random coding goes back to Shannon. As often happened in the
history of mathematics, a brilliant idea solves one problem but opens a Pandora
box of other questions. In this case, a particular problem that emerged from the
aftermath of random coding was the problem of finding good non-random codes.
A major part of modern information and coding theory revolves around this problem, and so far no general satisfactory solution has been found. However, a number
of remarkable partial results have been achieved, some of which are discussed in
this book.
Continuing with random coding, write the expected error-probability for a random encoding rule F:
E := E (F) = ( f )P(F = f ).

(1.4.17)

Theorem 1.4.11
(i) There exists a deterministic encoding rule f with ( f ) E .

E
(ii)
P (F) <
for any (0, 1).
1
Proof
For (ii), use the Chebyshev inequality (see PSE I, p. 75):

Part (i) is obvious.
E
1
E = 1 .

P (F)
1
E

1.4 Channels of information transmission

Definition 1.4.12

For random words X(N) = X1 . . . XN and Y(N) = Y1 . . . YN define

1 (N) (N)
I X :Y
CN := sup
, over input
N

probability distributions PX(N) .

(1.4.18)

Recall that I X(N) : Y(N) is the mutual entropy given by

h X(N) h X(N) |Y(N) = h Y(N) h Y(N) |X(N) .
Remark 1.4.13 A simple heuristic argument (which will be made rigorous in
Section 2.2) shows that the capacity of the channel cannot exceed the mutual
information between its input and output. Indeed, for each typical input Nsequence, there are
(N) |X(N) )

approximately 2h(Y

possible Y(N) sequences,

all of them equally likely. We will not be able to detect which sequence X was sent
unless no two X(N) sequences produce the same Y(N) output sequence. The total
(N)
number of typical Y(N) sequences is 2h(Y ) . This set has to be divided into subsets
(N) (N)
of size 2h(Y |X ) corresponding to the different input X(N) sequences. The total
number of disjoint sets is
(N) )h(Y(N) |X(N) )

2h(Y

(N) :Y(N)

= 2I (X

Hence, the total number of distinguishable signals of the length N could not be
(N) (N)
bigger than 2I (X :Y ) . Putting the same argument slightly differently, the number
(N)
(N) (N)
of typical sequences X(N) is 2Nh(X ) . However, there are only 2Nh(X ,Y ) jointly
typical sequences (X(N) , Y(N) ). So, the probability that any randomly chosen pair
(N) (N)
is jointly typical is about 2I (X :Y ) . So, the number of distinguished signals is
(N)
(N)
(N) (N)
bounded by 2h(X )+h(Y )h(X |Y ) .
Theorem 1.4.14

(Shannons SCT: converse part) The channel capacity C obeys

C lim sup CN .

(1.4.19)

Proof Consider a code f = fN : UN XN J N , where UN = 2N(R+o(1)) , R

(0, 1). We want to prove that for any decoding rule

( f ) 1

CN + o(1)
.
R + o(1)

(1.4.20)

Essentials of Information Theory

The assertion of the theorem immediately follows from (1.4.20) and the definition
of the channel capacity because
lim inf ( f ) 1
N

1
lim sup CN
R
N

which is > 0 when R > lim supN CN .

Let us check (1.4.20) for one-to-one f (otherwise ( f ) is even bigger). Then a
codeword X(N) = f (U) is equidistributed when string U is, and, if a decoding rule
is f : J N X , we have, for N large enough,

(N)
(N)
(N) (N)
NCN I X : Y
(cf. Theorem 1.2.6)
I X : f (Y )

= h X(N) h X(N) | f(Y(N) )

= log r h X(N) | f(Y(N) ) (by equidistribution)
log r ( f ) log(r 1) 1.
Here and below r = U . The last bound follows by the generalised Fano inequality
(1.2.25). Indeed, observe that the (random) codeword X(N) = f (U) takes r values

(N)
(N)
x1 , . . . , xr from the codeword set X (= XN , and the error-probability is
r

( f ) = P(X(N) = xi , f(Y(N) ) = xi ).
(N)

(N)

i=1

So, (1.2.25) implies

(N) (N)
h X | f (Y ) h2 ( ) + log(r 1) 1 + ( f ) log(r 1),
and we obtain NCN log r ( f ) log(r 1) 1. Finally, r = 2N(R+o(1)) and

NCN N(R + o(1)) ( f ) log 2N(R+o(1)) 1 ,
i.e.

( f )

C + o(1)
N(R + o(1)) NCN

= 1 N
.
R + o(1)
log 2N(R+o(1)) 1

Let p(X(N) , Y(N) ) be the random variable that assigns, to random words X(N) and
the joint probability of having these words at the input and output of a channel, respectively. Similarly, pX (X(N) ) and pY (Y(N) ) denote the random variables
that give the marginal probabilities of words X(N) and Y(N) , respectively.
Y(N) ,

1.4 Channels of information transmission

Theorem 1.4.15 (Shannons SCT: direct part) Suppose we can find a constant
c (0, 1) such that for any R (0, c) and N 1 there exists a random coding
F(u1 ), . . . , F(ur ), where r = 2N(R+o(1)) , with IID codewords F(ui ) J N , such
that the (random) input/output mutual information
N :=

p(X(N) , Y(N) )
1
log
N
pX (X(N) )pY (Y(N) )

(1.4.21)

converges in probability to c as N . Then the channel capacity C c.

The proof of Theorem 1.4.15 is given after Worked Examples 1.4.24 and 1.4.25
(the latter is technically rather involved). To start with, we explain the strategy of
the proof outline by Shannon in his original 1948 paper. (It took about 10 years
before this idea was transformed into a formal argument.)
First, one generates a random codebook X consisting of r = 2NR words,
X(N) (1), . . . , X(N) (r). The codewords X(N) (1), . . . , X(N) (r) are assumed to be
known to both the sender and the receiver, as well as the channel transition
matrix Pch (y|x). Next, the message is chosen according to a uniform distribution,
and the corresponding codeword is sent over a channel. The receiver uses the maximum likelihood (ML) decoding, i.e. choose the a posteriori most likely message.
But this procedure is difficult to analyse. Instead, a suboptimal but straightforward
typical set decoding is used. The receiver declares that the message w is sent if there
is only one input such that the codeword for w and the output of the channel are
jointly typical. If no such word exists or it is non-unique then an error is declared.
Surprisingly, this procedure is asymptotically optimal. Finally, the existence of a
good random codebook implies the existence of a good non-random coding.
In other words, channel capacity C is no less than the supremum of the values
c for which the convergence in probability in (1.4.21) holds for an appropriate
random coding.
Corollary 1.4.16

With c as in the assumptions of Theorem 1.4.15, we have that

sup c C lim sup CN .

(1.4.22)

So, if the LHS and RHS sides of (1.4.22) coincide, then their common value gives
the channel capacity.
Next, we use Shannons SCT for calculating the capacity of an MBC. Recall (cf.
(1.4.2)), for an MBC,

N
(1.4.23)
P y(N) |x(N) = P(yi |xi ).
i=1

Essentials of Information Theory

Theorem 1.4.17

For an MBC,

I X(N) : Y(N)

I(X j : Y j ),

(1.4.24)

j=1

with equality if the input symbols X1 , . . . , XN are independent.

N

Proof Since P y(N) |x(N) = P(y j |x j ), the conditional entropy h Y(N) |X(N)
j=1

equals the sum h(Y j |X j ). Then the mutual information

I X(N) : Y(N) = h Y(N) h Y(N) |X(N)

= h Y(N) h(Y j |X j )
1 jN

h(Y j ) h(Y j |X j ) = I(X j : Y j ).
j

The equality holds iff Y1 , . . . ,YN are independent. But Y1 , . . . ,YN are independent if
X1 , . . . , XN are.
Remark 1.4.18 Compare with inequalities (1.4.24) and (1.2.27). Note the opposite inequalities in the bounds.
Theorem 1.4.19

The capacity of an MBC is

C = sup I(X1 : Y1 ).
pX1

(1.4.25)

The supremum is over all possible distributions pX1 of the symbol X1 .

Proof By the definition of CN , NCN does not exceed
sup I(X(N) : Y(N) ) sup I(X j : Y j ) = N sup I(X1 : Y1 ).
pX1
pX
j pX j
So, by Shannons SCT (converse part),
C lim sup CN sup I(X1 : Y1 ).
pX1
N
On the other hand, take a random coding F, with codewords F(ul ) = Vl1 . . . VlN ,
1 l r, containing IID symbols Vl j that are distributed according to p , a probability distribution that maximises I(X1 : Y1 ). [Such random coding is defined for

where j := log

p (X

E j = E log

p(X j ,Y j )
= Ip (X1 : Y1 ).
j )pY (Y j )

p (X

By the law of large numbers for IID random variables (see Theorem 1.3.5), for the
random coding as suggested,
P

N Ip (X1 : Y1 ) = sup I(X1 : Y1 ).

pX1

By Shannons SCT (direct part),

C sup I(X1 : Y1 ).
pX1
Thus, C = sup pX I(X1 : Y1 ).
1

Remark 1.4.20

(a) The pair (X1 ,Y1 ) may be replaced by any (X j ,Y j ), j 1.

(b) Recall that the joint

of X1 and Y1 is defined by P(X1 = x,Y1 = y) =
distribution

pX1 (x)P(y|x) where P(y|x) is the channel matrix.
(c) Although, as was noted, the construction holds for each r (that is, for each
R 0) only R C are reliable.
Example 1.4.21 A helpful statistician preprocesses the output of a memoryless channel (MBC) with transition probabilities P(y|x) and channel capacity C =
max pX I(X : Y ) by forming Y = g(Y ): he claims that this will strictly improve the
capacity. Is he right? Surely not, as preprocessing (or doctoring) does not increase
the capacity. Indeed,
I(X : Y ) = h(X) h(X|Y ) h(X) h(X|g(Y )) = I(X : g(Y )).

(1.4.26)

Under what condition does he not strictly decrease the capacity? Equality in
(1.4.26) holds iff, under the distribution pX that maximises I(X : Y ), the random variables X and Y are conditionally independent given g(Y ). [For example,
g(y1 ) = g(y2 ) iff for any x, PX|Y (x|y1 ) = PX|Y (x|y2 ); that is, g glues together only
those values of y for which the conditional probability PX|Y ( |y) is the same.] For
an MBC, equality holds iff g is one-to-one, or p = P(1|0) = P(0|1) = 1/2.

Essentials of Information Theory

Formula (1.4.25) admits a further simplification when the channel is symmetric

(MBSC), i.e. P(1|0) = P(0|1) = p. More precisely, in accordance with Remark
1.4.5(a) (see (1.4.11)) we obtain
Theorem 1.4.22

For an MBSC, with the row error-probability p,

C = 1 h(p, 1 p) = 1 (p)

(1.4.27)

(see (1.4.11)). The channel capacity is realised by a random coding with the IID
symbols Vl j taking values 0 and 1 with probability 1/2.
Worked Example 1.4.23
(a) Consider a memoryless channel with two input symbols A and B, and three
output symbols, A, B, . Suppose each input symbol is left intact with probability 1/2, and transformed into a with probability 1/2. Write down the channel
matrix and calculate the capacity.
(b) Now calculate the new capacity of the channel if the output is further processed
by someone who cannot distinguish A and , so that the matrix becomes

1
0
.
1/2 1/2
Solution (a) The channel has the matrix

1/2 0 1/2
0 1/2 1/2
and is symmetric (the rows are permutations of each other). So, h(Y |X = x) =
1
1
2 log = 1 does not depend on the value of x = A, B. Then h(Y |X) = 1, and
2
2
I(X : Y ) = h(Y ) 1.
(1.4.28)
If P(X = A) = then Y has the output distribution

1 1
1
, (1 ),
2 2
2
and h(Y |X) is maximised at = 1/2. Then the capacity equals
1
h(1/4, 1/4, 1/2) 1 = .
2

(1.4.29)

(b) Here, the channel is not symmetric. If P(X = A) = then the conditional
entropy is decomposed as
h(Y |X) = h(Y |X = A) + (1 )h(Y |X = B)
= 0 + (1 ) 1 = (1 ).

1.4 Channels of information transmission

Then
h(Y ) =

1+ 1
1
1+
log

log
2
2
2
2

and

1+
1+ 1
1
log

log
1+
2
2
2
2
which is maximised at = 3/5, with the capacity given by

log 5 2 = 0.321928.
I(X : Y ) =

Our next goal is to prove the direct part of Shannons SCT (Theorem 1.4.15). As
was demonstrated earlier, the proof is based on two consecutive Worked Examples
below.
Worked Example 1.4.24 Let F be a random coding, independent of the source
string U, such that the codewords F(u1 ), . . . , F(ur ) are IID, with a probability distribution pF :
pF (v) = P(F(u) = v),

v (= v(N) ) J N .

Here, u j , j = 1, . . . , r, are source strings, and r = 2N(R+o(1)) . Define random codewords V1 , . . . , Vr1 by
if U = u j then Vi := F(ui )
and Vi := F(ui+1 )

for i < j (if any),

for i j (if any),
1 j r, 1 i r 1.

(1.4.30)

Then U (the message string), X = F(U) (the random codeword) and V1 , . . . , Vr1
are independent words, and each of X, V1 , . . . , Vr1 has distribution pF .
Solution This is straightforward and follows from the formula for the joint probability,
P(U = u j , X = x, V1 = v1 , . . . , Vr1 = vr1 )
= P(U = u j ) pF (x) pF (v1 ) . . . pF (vr1 ).

(1.4.31)

Worked Example 1.4.25 Check that for the random coding as in Worked Example 1.4.24, for any > 0,
E = E (F) P(N ) + r2N .

(1.4.32)

Here,
the random variable N is defined in (1.4.21), with EN =
1 (N) (N)
I X :Y
.
N

Essentials of Information Theory

Solution For given words x(= x(N) ) and y(= y(N) ) J N , denote
*
+
Sy (x) := x J N : P(y | x ) P(y | x) .

(1.4.33)

That is, Sy (x) includes all words the ML decoder may produce in the situation
where x was sent and y received. Set, for a given non-random encoding rule f
and a source string u, ( f , u, y) = 1 if f (u ) Sy ( f (u)) for some u = u, and
( f , u, y) = 0 otherwise. Clearly, ( f , u, y) equals

1 1 f (u ) Sy ( f (u))
u : u =u

= 1 1 1 f (u ) Sy ( f (u)) .
u :u =u

It is plain that, for all non-random encoding f , ( f ) E ( f , U, Y), and for all
random encoding F, E = E (F) E (F, U, Y). Furthermore, for the random
encoding as in Worked Example 1.4.24, the expected value E (F, U, Y) does not
exceed

r1
= pX (x) P(y|x)
E 1 1 1 Vi SY (X)
x
y
i=1

r1
E 1 1 1 Vi SY (X) |X = x, Y = y ,
i=1

which, owing to independence, equals

pX (x) P(y|x) 1 E 1 1{Vi Sy (x)} .
x

i=1

Furthermore, due to the IID property (as explained in Worked Example 1.4.24),

r1
r1
E
1

{V S (x)} = (1 Qy (x)) ,
i

i=1

where

Qy (x) := 1 x Sy (x) pX (x ).
x

Hence, the expected error-probability E 1 E (1 QY (X))r1 .

Denote by T = T( ) the set of pairs of words x, y for which
N =

p(x, y)
1
log
>
N
pX (x)pY (y)

and use the identity

1 (1 Qy (x))r1 =

(1 Qy (x)) j Qy (x).

j=0

(1.4.34)

1.4 Channels of information transmission

Next observe that

1 (1 Qy (x))r1 1,

when (x, y) T.

(1.4.35)

Owing to the fact that when (x, y) T,

r1

1 (1 Qy (x))r = (1 Qy (x)) j Qy (x) (r 1)Qy (x),
j=1

this yields

E P (X,Y ) T + (r 1)

pX (x)P(y|x)Qy (x).

(1.4.36)

(x,y)T

Now observe that

P (X,Y ) T = P N .

(1.4.37)

Finally, for (x, y) T and x Sy (x),

P(y|x ) P(y|x) pY (y)2N .

pX (x )
gives P X = x |Y = y pX (x )2N . Then summing over
pY (y)
x Sy (x) gives 1 P (SY (x)|Y = y) Qy (x)2N , or
Multiplying by

Qy (x) 2N .

(1.4.38)

Substituting (1.4.37) and (1.4.38) into (1.4.36) yields (1.4.32).

Proof of Theorem 1.4.15 The proof of Theorem 1.4.15 can now be easily completed. Take R = c 2 and = c . Then, as r = 2N(R+o(1)) , we have that E
does not exceed
P(N c ) + 2N(c 2 c + + o(1)) = P(N c ) + 2N .
This quantity tends to zero as N , because P(N c ) 0 owing to the
P

condition N c. Therefore, the random coding F gives the expected error probability that vanishes as N .
By Theorem 1.4.11(i), for any N 1 there exists a deterministic encoding f = fN
such that, for R = c 2 , lim ( f ) = 0. Hence, R is a reliable transmission rate.
N

This is true for any > 0, thus C c.

The form of the argument used in the above proof was proposed by P. Whittle
(who used it in his lectures at the University of Cambridge) and appeared in [52],
pp. 114117. We thank C. Goldie for this information. An alternative approach is
based on the concept of joint typicality; this approach is used in Section 2.2 where
we discuss channels with continuously distributed noise.

Essentials of Information Theory

Theorems 1.4.17 and 1.4.19 may be extended to the case of a memoryless channel with an arbitrary (finite) output alphabet, Jq = {0, . . . , q 1}. That is, at the
input of the channel we now have a word Y(N) = Y1 . . . YN where each Y j takes a
(random) value from Jq . The memoryless property means, as before, that

N
Pch y(N) |x(N) = P(yi | xi ),

(1.4.39)

i=1

and the symbol-to-symbol channel probabilities P(y|x) now form a 2 q stochastic

matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of the channel matrix are permutations of each other and double symmetric if in
addition the columns of the channel matrix are permutations of each other. The
definitions of the reliable transmission rate and the channel capacity are carried
through without change. The capacity of a memoryless binary channel is depicted
in Figure 1.8.
Theorem 1.4.26 The capacity of a memoryless symmetric channel with an output alphabet Jq is
C log q h(p0 , . . . , pq1 )

(1.4.40)

where (p0 , . . . , pq1 ) is a row of the channel matrix. The equality is realised in the
case of a double-symmetric channel, and the maximising random coding has IID
symbols Vi taking values from Jq with probability 1/q.
Proof The proof is carried out as in the binary case, by using the fact that I(X1 :
Y1 ) = h(Y1 ) h(Y1 |X1 ) log q h(Y1 |X1 ). But in the symmetric case
h(Y1 | X1 ) = P(X1 = x)P(y | x) log P(y | x)
x,y

= P(X1 = x) pk log pk = h(p0 , . . . , pq1 ).

(1.4.41)

If, in addition, the columns of the channel matrix are permutations of each other,
then h(Y1 ) attains log q. Indeed, take a random coding as suggested. Then P(Y = y)
q1
1
= P(X1 = x)P(y|x) = P(y|x). The sum P(y|x) is along a column of the
q x
x
x=0
channel matrix, and it does not depend on y. Hence, P(Y = y) does not depend on
y Iq , which means equidistribution.
Remark 1.4.27 (a) In the random coding F used in Worked Examples 1.4.24 and
1.4.25 and Theorems 1.4.6, 1.4.15 and 1.4.17, the expected error-probability E 0
with N . This guarantees not only the existence of a good non-random coding
for which the error-probability E vanishes as N (see Theorem 1.4.11(i)), but
also that almost all codes are asymptotically good. In fact, by Theorem 1.4.11(ii),

1.4 Channels of information transmission

C ( p)

p
1

Figure 1.8

with = 1 E, P (F) < E 1 E 1, as N . However, this does

not help to find a good code: constructing good codes remains a challenging task
in information theory, and we will return to this problem later.
Worked Example 1.4.28 Bits are transmitted along a communication channel.
With probability a bit may be inverted and with probability it may be rendered
illegible. The fates of successive bits are independent. Determine the optimal coding for, and the capacity of, the channel.
Solution The channel matrix is 2 3:

=

1
1

;

the rows are permutations of each other, and hence have equal entropies. Therefore,
the conditional entropy h(Y |X) equals
h(1 , , ) = (1 ) log(1 ) log log ,
which does not depend on the distribution of the input symbol X.
Thus, I(X : Y ) is maximised when h(Y ) is. If pY (0) = p and pY (1) = q, then
h(Y ) = log p log p q log q,

Essentials of Information Theory

which is maximised when p = q = (1 )/2 (by pooling), i.e. pX (0) = pX (1) =

1/2. This gives the following expression for the capacity:
1
+ (1 ) log(1 ) + log
2

1
= (1 ) 1 h
,
.
1
1

(1 ) log

Worked Example 1.4.29

(a) (Data-processing inequality) Consider two independent channels in series. A
random variable X is sent through channel 1 and received as Y . Then it is sent
through channel 2 and received as Z . Prove that
I(X : Z) I(X : Y ),

so the further processing of the second channel can only reduce the mutual
information.
The independence of the channels means that given Y , the random variables
X and Z are conditionally independent. Deduce that
h(X, Z|Y ) = h(X|Y ) + h(Z|Y )

and
h(X,Y, Z) + h(Z) = h(X, Z) + h(Y, Z).

Does the equality hold in the data processing inequality

I(X : Z) = I(X : Y )?
(b) The input and output of a discrete-time channel are both expressed in an alphabet whose letters are the residue classes of integers mod r, where r is fixed. The
transmitted letter [x] is received as [ j + x] with probability p j , where x and j
are integers and [c] denotes the residue class of c mod r. Calculate the capacity
of the channel.
Solution (a) Given Y , the random variables X and Z are conditionally independent.
Hence,
h(X | Y ) = h(X | Y, Z) h(X | Z),
and
I(X : Y ) = h(X) h(X|Y ) h(X) h(X | Z) = I(X : Z).

1.4 Channels of information transmission

. . .

Figure 1.9

The equality holds iff X and Y are conditionally independent given Z, e.g. if the
second channel is error-free (Y, Z) Z is one-to-one, or the first channel is fully
noisy, i.e. X and Y are independent.
(b) The rows of the channel matrix are permutations of each other. Hence h(Y |X) =
h(p0 , . . . , pr1 ) does not depend on pX . The quantity h(Y ) is maximised when
pX (i) = 1/r, which gives
C = log r h(p0 , . . . , pr1 ).

Worked Example 1.4.30 Find the error-probability of a cascade of n identical

independent binary symmetric channels (MBSCs), each with the error-probability
0 < p < 1 (see Figure 1.9).
Show that the capacity of the cascade tends to zero as n .
Solution The channel matrix of a combined n-cascade channel is n where

1 p p
=
.
p 1 p
Calculating the eigenvectors/values yields

1 1 + (1 2p)n 1 (1 2p)n
n
,
=
2 1 (1 2p)n 1 + (1 2p)n
which gives the error-probability 1/2 (1 (1 2p)n ). If 0 < p < 1, n converges
to

1/2 1/2
,
1/2 1/2
and the capacity of the channel approaches
1 h(1/2, 1/2) = 1 1 = 0.
If p = 0 or 1, the channel is error-free, and C 1.

Essentials of Information Theory

Worked Example 1.4.31

Consider two independent MBCs, with capacities
C1 ,C2 bits per second. Prove, or provide a counter-example to, each of the following claims about the capacity C of a compound channel formed as stated.
(a) If the channels are in series, with the output from one being fed into the other
with no further coding, then C = min[C1 ,C2 ].
(b) Suppose the channels are used in parallel in the sense that at every second a
symbol (from its input alphabet) is transmitted through channel 1 and the next
symbol through channel 2; each channel thus emits one symbol each second.
Then C = C1 +C2 .
(c) If the channels have the same input alphabet and at each second a symbol is
chosen and sent simultaneously down both channels, then C = max[C1 ,C2 ].
(d) If channel i = 1, 2 has matrix i and the compound channel has

1 0
,
=
0 2

then C is given by 2C = 2C1 + 2C2 . To what mode of operation does this correspond?
Solution (a)
X

channel 1

channel 2

As in Worked Example 1.4.29a,

I(X : Z) I(X : Y ),

I(X : Z) I(Y : Z).

Hence,
C = sup I(X : Z) sup I(X : Y ) = C1
pX

and similarly
C sup I(Y : Z) = C2 ,
pY

i.e. C min[C1 ,C2 ]. A strict inequality may occur: take (0, 1/2) and the
matrices

1
1
, ch 2
,
ch 1
1
1
and
1
ch [1 + 2]
2

(1 )2 + 2
2 (1 )

2 (1 )
(1 )2 + 2

1.4 Channels of information transmission

Here, 1/2 > 2 (1 ) > ,

C1 = C2 = 1 h( , 1 ),
and

C = 1 h 2 (1 ), 1 2 (1 ) < Ci

because h( , 1 ) strictly increases in [0, 1/2].

(b)
X1

channel 1

channel 2

The capacity of the combined channel

C = sup I (X1 , X2 ) : (Y1 ,Y2 ) .
p(X1 ,X2 )

But

I (X1 , X2 ) : (Y1 ,Y2 ) = h(Y1 ,Y2 ) h Y1 ,Y2 |X1 , X2
h(Y1 ) + h(Y2 ) h(Y1 |X1 ) h(Y2 |X2 )
= I(X1 : Y1 ) + I(X2 : Y2 );

equality applies iff X1 and X2 are independent. Thus, C = C1 + C2 and the maximising p(X1 ,X2 ) is pX1 pX2 where pX1 and pX2 are maximisers for I(X1 : Y1 ) and
I(X2 : Y2 ).
(c)
channel 1

channel 2

X

Here,

C = sup I X : (Y1 : Y2 )
pX

and

I (Y1 : Y2 ) : X = h(X) h X|Y1 ,Y2
h(X) min h(X|Y j ) = min I(X : Y j ).
j=1,2

j=1,2

Essentials of Information Theory

Thus, C max[C1 ,C2 ]. A strict inequality may occur: take an example from part
(a). Here, Ci = 1 h( , 1 ). Also,

I (Y1 ,Y2 ) : X = h(Y1 ,Y2 ) h Y1 ,Y2 |X
= h(Y1 ,Y2 ) h(Y1 |X) h(Y2 |X)
= h(Y1 ,Y2 ) 2h( , 1 ).
If we set pX (0) = pX (1) = 1/2 then

(Y1 ,Y2 ) = (0, 0) with probability (1 )2 + 2 2,

(Y1 ,Y2 ) = (1, 1) with probability (1 )2 + 2 2,
(Y1 ,Y2 ) = (1, 0) with probability (1 ),
(Y1 ,Y2 ) = (0, 1) with probability (1 ),
with

h(Y1 ,Y2 ) = 1 + h 2 (1 ), 1 2 (1 ) ,

and

I (Y1 ,Y2 ) : X = 1 + h 2 (1 ), 1 2 (1 ) 2h( , 1 )
> 1 h( , 1 ) = Ci .
Hence, C > Ci , i = 1, 2.
(d)
X1

channel 1

channel 2

X : X1 or X2

The difference with part (c) is that every second only one symbol is sent, either
to channel 1 or 2. If we fix probabilities and 1 that a given symbol is sent
through a particular channel then
I(X : Y ) = h( , 1 ) + I(X1 : Y1 ) + (1 )I(X2 : Y2 ).

(1.4.42)

Indeed, I(X : Y ) = h(Y ) h(Y |X), where

h(Y ) = pY1 (y) log pY1 (y) (1 )pY2 (y) log(1 )pY2 (y)
y

= log (1 ) log(1 ) + h(Y1 ) + (1 )h(Y2 )

1.4 Channels of information transmission

and
h(Y |X) = pX1 ,Y1 (x, y) log pY1 |X1 (y|x)
x,y

(1 )pX2 ,Y2 (y|x) log pY2 |X2 (y|x)

x,y

= h(Y1 |X1 ) + (1 )h(Y2 |X2 )

proving (1.4.42). This yields
C = max

0 1

h( , 1 ) + C1 + (1 )C2 ;

the maximum is given by

= 2C1 /(2C1 + 2C2 ), 1 = 2C2 /(2C1 + 2C2 ),

and C = log 2C1 + 2C2 .
Worked Example 1.4.32 A spy sends messages to his contact as follows. Each
hour either he does not telephone, or he telephones and allows the telephone to ring
a certain number of times not more than N , for fear of detection. His contact does
not answer, but merely notes whether or not the telephone rings, and, if so, how
many times. Because of deficiencies in the telephone system, calls may fail to be
properly connected; the correct connection has probability p, where 0 < p < 1, and
is independent for distinct calls, but the spy has no means of knowing which calls
reach his contact. If connection is made, then the number of rings is transmitted
correctly. The probability of a false connection from another subscriber at a time
when no call is made may be neglected. Write down the channel matrix for this
channel and calculate the capacity explicitly. Determine a condition on N in terms
of p which will imply, with optimal coding, that the spy will always telephone.
Solution The channel alphabet is {0, 1, . . . , N}: 0 non-call (in a given hour), and
j 1 j rings. The channel matrix is P(0|0) = 1, P(0| j) = 1 p and P( j| j) = p,
1 j N, and h(Y |X) = q(p log p + (1 p) log(1 p)), where q = pX (X 1).
Furthermore, given q, h(Y ) attains its maximum when
pY (0) = 1 pq, pY (k) =

pq
, 1 k N.
N

Maximising I(X : Y ) = h(Y ) h(Y |X) in q yields p(1 p)(1p)/p (1 pq) =

pq/N or

(1p)/p 1
1
1
1
1+
, 1.
q = min
p
Np 1 p

Essentials of Information Theory

1
The condition q = 1 is equivalent to log N log(1 p), i.e.
p
1
N
.
(1 p)1/p

1.5 Differential entropy and its properties

Definition 1.5.1 Suppose that the random variable X has a probability density
(PDF) p(x), x Rn :
P{X A} =

p(x)dx
A

for any (measurable) set A Rn , where p(x) 0, x Rn , and

differential entropy hdiff (X) is defined as
hdiff (X) =

Rn dxp(x) = 1.

The

p(x) log p(x)dx,

(1.5.1)

under the assumption that the integral is absolutely convergent. As in the discrete
case, hdiff (X) may be considered as a functional of the density p : x Rn R+ =
[0, ). The difference is however that hdiff (X) may be negative, e.g. for a uniform
1
distribution on [0, a], hdiff (X) = 0a dx(1/a) log(1/a) = log a < 0 for a < 1. [We
write x instead of x when x R.] The relative, joint and conditional differential
entropy are defined similarly to the discrete case:
hdiff (X||Y ) = Ddiff (p||p ) =
hdiff (X,Y ) =

p(x) log

p (x)
dx,
p(x)

(1.5.2)

pX,Y (x, y) log pX,Y (x, y)dxdy,

(1.5.3)

hdiff (X|Y ) = pX,Y (x, y) log pX|Y (x|y)dxdy

(1.5.4)

= hdiff (X,Y ) hdiff (Y ),

again under the assumption that the integrals are absolutely convergent. Here, pX,Y
is the joint probability density and pX|Y the conditional density (the PDF of the
conditional distribution). Henceforth we will omit the subscript diff when it is clear
what entropy is being addressed. The assertions of Theorems 1.2.3(b),(c), 1.2.12,
and 1.2.18 are carried through for the differential entropies: the proofs are completely similar and will not be repeated.
Remark 1.5.2

Let 0 x 1. Then x can be written as a sum n 2n where

n (= n (x)) equals 0 or 1. For most of the numbers x the series is not reduced to a
finite sum (that is, there are infinitely many n such that n = 1; the formal statement

1.5 Differential entropy and its properties

is that the (Lebesgue) measure of the set of numbers x (0, 1) with infinitely many
n (x) = 1 equals one). Thus, if we want to encode x by means of binary digits we
would need, typically, a codeword of an infinite length. In other words, a typical
value for a uniform random variable X with 0 X 1 requires infinitely many bits
for its exact description. It is easy to make a similar conclusion in a general case
when X has a PDF fX (x).
However, if we wish to represent the outcome of the random variable X with
an accuracy of first n binary digits then we need, on average, n + h(X) bits where
h(X) is the differential entropy of X. Differential entropies can be both positive
and negative, and can even be . Since h(X) can be of either sign, n + h(X) can
be greater or less than n. In the discrete case the entropy is both shift and scale
invariant since it depends only on probabilities p1 , . . . , pm , not on the values of the
random variable. However, the differential entropy is shift but not scale invariant
as is evident from the identity (cf. Theorem 1.5.7)
h(aX + b) = h(X) + log |a|.
However, the relative entropy, i.e. KullbackLeibler distance D(p||q), is scale
invariant.
Worked Example 1.5.3

Consider a PDF on 0 x e1 ,

fr (x) = Cr

1
, 0 < r < 1.
x( ln x)r+1

Then the differential entropy h(X) = .

Solution After the substitution y = ln x we obtain
0 e1
0

1
dx =
x( ln x)r+1

0
1
1

yr+1

1
dy = .
r

Thus, Cr = r. Further, using z = ln( ln x)

0 e1
ln( ln x)
0

Hence,
h(X) =

x( ln x)

dx =
r+1

0
0

zerz dz =

1
.
r2

fr (x) ln fr (x)dx

= fr (x) ln r + ln x + (r + 1) ln( ln x) dx

0 e1
r
ln( ln x)
r(r + 1)
= ln r
dx,
x( ln x)r
x( ln x)r+1
0
0

so that for 0 < r < 1, the second term is infinite, and two others are finite.

Essentials of Information Theory

Theorem 1.5.4 Let X = (X1 , . . . , Xd ) N( ,C) be a multivariate normal random

vector, of mean = (1 , . . . , d ) and covariance matrix C = (ci j ), i.e. EXi = i ,
E(Xi i )(X j j ) = ci j = c ji , 1 i, j d . Then
h(X) =

1
log (2 e)d detC .
2

(1.5.5)

Proof The PDF pX (x) is

1
1
1
p(x) =
x

exp

,C
(x

)
, x Rd .
1/2
2
(2 )d detC
Then h(X) takes the form

log e

1
d
1
p(x) log (2 ) detC
x ,C (x ) dx
2
Rd
2

log e
1
=
E (xi i )(x j j ) C1 i j + log (2 )d detC
2
2
i, j

1
log e 1
=
C i j E(xi i )(x j j ) + log (2 )d detC

2 i, j
2

log e 1
1
=
C i j C ji + log (2 )d detC

2 i, j
2

1

d log e 1
+ log (2 )d detC = log (2 e)d detC .
=
2
2
2
0

Theorem 1.5.5 For a random vector

X = (X1 , . . . , Xd ) with mean and covari
ance matrix C = (Ci j ) (i.e. Ci j = E (Xi i )(X j j )] = C ji ),
h(X)

1
log (2 e)d detC ,
2

(1.5.6)

with the equality iff X is multivariate normal.

Proof Let p(x) be the PDF of X and p0 (x) the normal density with mean
and covariance matrix C. Without loss of generality assume = 0. Observe that
log p0 (x) is, up to an additive constant term, a quadratic form in xk . Furthermore,

1.5 Differential entropy and its properties

1
1
0

for each monomial xi x j , dxp (x)xi x j = dxp(x)xi x j = Ci j = C ji , and the moment

of quadratic form log p0 (x) are equal. We have
0 D(p||p0 ) (by Gibbs) =
1

p(x) log

p(x)
dx
p0 (x)

= h(p) p(x) log p0 (x)dx

1
= h(p) p0 (x) log p0 (x)dx
(by the above remark) = h(p) + h(p0 ).
The equality holds iff p = p0 .
Worked Example 1.5.6

(a) Show that the exponential density maximises the differential entropy among
the PDFs on [0, ) with given mean, and the normal density maximises the
differential entropy among the PDFs on R with a given variance.

Moreover, let X = (X1 , . . . , Xd )T be a random

vector with EX = 0 and
1
EXi X j = Ci j , 1 i, j d . Then hdiff (X) 2 log (2 e)d det(Ci j ) , with equality iff X N(0,C).
(b) Prove that the bound h(X) log m (cf. (1.2.7)) for a random variable X taking not more than m values admits the following generalisation for a discrete
random variable with infinitely many values in Z+ :
h(X)

1
1
log 2 e(Var X + ) .
2
12

Solution (a) For the Gaussian case, see Theorem 1.5.5. In the exponential
case, by the Gibbs inequality, for any random variable Y with PDF f (y),
1
f (y) log f (y)e y / dy 0 or
h(Y ) ( EY log e log ) = h(Exp( )),
with equality iff Y Exp( ), = (EY )1 .
(b) Let X0 be a discrete random variable with P(X0 = i) = pi , i = 1, 2, . . ., and the
random variable U be independent of X0 and uniform on [0, 1]. Set X = X0 + U.
For a normal random variable Y with Var X = VarY ,
hdiff (X) hdiff (Y ) =

1
1
1
log 2 eVar Y = log 2 e(Var X + ) .
2
2
12

Essentials of Information Theory

The value of EX is not essential for h(X) as the following theorem shows.
Theorem 1.5.7
(a) The differential entropy is not changed under the shift: for all y Rd ,
h(X + y) = h(X).
(b) The differential entropy changes additively under multiplication:
h(aX) = h(X) + log |a|,

for all a R.

Furthermore, if A = (Ai j ) is a d d non-degenerate matrix, consider the affine

transformation x Rd Ax + y Rd .
(c) Then
h(AX + y) = h(X) + log | det A|.

(1.5.7)

Proof The proof is straightforward and left as an exercise

Worked Example 1.5.8 (The data-processing inequality for the relative entropy)
Let S be a finite set, and = ((x, y), x, y S) be a stochastic kernel (that is, for
all x, y S, (x, y) 0 and yS (x, y) = 1; in other words, (x, y) is a transition probability in a Markov chain). Prove that D(p1 ||p2 ) D(p1 ||p2 ) where
pi (y) = xS pi (x)(x, y), y S (that is, applying a Markov operator to both
probability distributions cannot increase the relative entropy).
Extend this fact to the case of the differential entropy.
Solution In the discrete case is defined by a stochastic matrix ((x, y)). By the
log-sum inequality (cf. PSE II, p. 426), for all y
p1 (w)(w, y)

p1 (x)(x, y) log w p2 (z)(z, y)

p1 (x)(x, y)
p2 (x)(x, y)
x
p1 (x)
.
= p1 (x)(x, y) log
p2 (x)
x

p1 (x)(x, y) log

Taking summation over y we obtain

D(p1 ||p2 ) = p1 (x)(x, y) log
x y

p1 (w)(w, y)
w

p2 (z)(z, y)
z

p1 (x)
= D(p1 ||p2 ).
p1 (x)(x, y) log
p
2 (x)
x y
In the continuous case a similar inequality holds if we replace summation by
integration.

1.5 Differential entropy and its properties

The concept of differential entropy has proved to be useful in a great variety of situations, very often quite unexpectedly. We consider here inequalities for
determinants and ratios of determinants of positive definite matrices (cf. [39], [36]).
Recall that the covariance matrix C = (Ci j ) of a random vector X = (X1 , . . . , Xd )
is positive definite, i.e. for any complex vector y = (y1 , . . . , yd ), the scalar product
(y,Cy) = Ci j yi y j is written as
i, j

2

E(X

)(X

)y
y
=
E
(X

)y

i i j j i j i i i 0.
i, j
i
Conversely, for any positive definite matrix C there exists a PDF for which C is a
covariance matrix, e.g. a multivariate normal distribution (if C is not strictly positive definite, the distribution is degenerate).
Worked Example 1.5.9

If C is positive definite then log[detC] is concave in C.

Solution Take two positive definite matrices C(0) and C(1) and [0, 1]. Let X(0)
and X(1) be two multivariate normal vectors, X(i) N(0,C(i) ). Set, as in the proof
of Theorem 1.2.18, X = X() , where the random variable takes two values, 0 and
1, with probabilities and 1 , respectively, and is independent of X(0) and X(1) .
Then the random variable X has covariance C = C(0) + (1 )C(1) , although X
need not be normal. Thus,

1
1
log 2 e)d + log det C(0) + (1 )C(1)
2
2

1
= log (2 e)d detC h(X) (by Theorem 1.5.5)
2
h(X|) (by Theorem 1.2.11)

1

log (2 e)d detC(1)

= log (2 e)d detC(0) +
2
2

1
d
log 2 e) + log detC(0) + (1 ) log detC(1) .
=
2

This property is often called the Ky Fan inequality and was proved initially in
1950 by using much more involved methods. Another famous inequality is due to
Hadamard:
Worked Example 1.5.10

For a positive definite matrix C = (Ci j ),

detC Cii ,
i

and the equality holds iff C is diagonal.

(1.5.8)

Essentials of Information Theory

Solution If X = (X1 , . . . , Xn ) N(0,C) then

1
1
log (2 e)d detC = h(X) h(Xi ) = log(2 eCii ),
2
i
i 2
with equality iff X1 , . . . , Xn are independent, i.e. C is diagonal.
Next we discuss the so-called entropypower inequality (EPI). The situation
with the EPI is quite intriguing: it is considered one of the mysterious facts of
information theory, lacking a straightforward interpretation. It was proposed by
Shannon; the book [141] contains a sketch of an argument supporting this inequality. However, the first rigorous proof of the EPI only appeared nearly 20 years later,
under some rather restrictive conditions that are still the subject of painstaking improvement. Shannon used the EPI in order to bound the capacity of an additive
channel with continuous noise by that of a Gaussian channel; see Chapter 4. The
EPI is also related to important properties of monotonicity of entropy; an example
is Theorem 1.5.15 below.
The existing proofs of the EPI are not completely elementary; see [82] for one
of the more transparent proofs.
Theorem 1.5.11 (Entropypower inequality). For two independent random variables X and Y with PDFs fX (x) and fY (x), x R1 ,
h(X +Y ) h(X +Y ),

(1.5.9)

where X and Y are independent normal random variables with h(X) = h(X ) and
h(Y ) = h(Y ).
In the d-dimensional case the entropypower inequality is as follows.

For two independent random variables X and Y with PDFs fX (x) and fY (x), x Rd ,
e2h(X+Y )/d e2h(X)/d + e2h(Y )/d .

(1.5.10)

It is easy to see that for d = 1 (1.5.9) and (1.5.10) are equivalent. In general,
inequality (1.5.9) implies (1.5.10) via (1.5.13) below which can be established
independently. Note that inequality (1.5.10) may be true or false for discrete random variables. Consider the following example: let X Y be independent with
PX (0) = 1/6, PX (1) = 2/3, PX (2) = 1/6. Then
16
18
2
h(X) = h(Y ) = ln 6 ln 4, h(X +Y ) = ln 36 ln 8 ln 18.
3
36
36
By inspection, e2h(X+Y ) = e2h(X) + e2h(Y ) . If X and Y are non-random constants
then h(X) = h(Y ) = h(X +Y ) = 0, and the EPI is obviously violated. We conclude

1.5 Differential entropy and its properties

that the existence of PDFs is an essential condition that cannot be omitted. In a

different form EPI could be extended to discrete random variables, but we do not
discuss this theory here.
Sometimes the differential entropy is defined as h(X) = E log2 p(X); then
(1.5.10) takes the form 2h(X+Y )/d 2h(X)/d + 2h(Y )/d .
The entropypower inequality plays a very important role not only in information theory and probability but in geometry and analysis as well. For illustration
we present below the famous BrunnMinkowski theorem that is a particular case
of the EPI. Define the set sum of two sets as
A1 + A2 = {x1 + x2 : x1 A1 , x2 A2 }.
By definition A + 0/ = A.
Theorem 1.5.12

(BrunnMinkowski)

(a) Let A1 and A2 be measurable sets. Then the volume

V (A1 + A2 )1/d V (A1 )1/d +V (A2 )1/d .

(1.5.11)

(b) The volume of the set sum of two sets A1 and A2 is greater than the volume
of the set sum of two balls B1 and B2 with the same volume as A1 and A2 ,
respectively:
V (A1 + A2 ) V (B1 + B2 ),

(1.5.12)

where B1 and B2 are spheres with V (A1 ) = V (B1 ) and V (A2 ) = V (B2 ).
Worked Example 1.5.13

Let C1 ,C2 be positive-definite d d matrices. Then

[det(C1 +C2 )]1/d [detC1 ]1/d + [detC2 ]1/d .

(1.5.13)

Solution Let X1 N(0,C1 ), X2 N(0,C2 ), then X1 + X2 N(0,C1 + C2 ). The

entropypower inequality yields

1/d
= e2h(X1 +X2 )/d
(2 e) det(C1 +C2 )

1/d
1/d
e2h(X1 )/d + e2h(X2 )/d = (2 e) detC1
+ (2 e) detC2
.
Worked Example 1.5.14 A Toplitz n n matrix C is characterised by the property that Ci j = Crs if |i j| = |r s|. Let Ck = C(1, 2, . . . , k) denote the principal
minor of the Toplitz positive-definite matrix formed by the rows and columns
1, . . . , k. Prove that for |C| = detC,
|C1 | |C2 |1/2 |Cn |1/n ,

(1.5.14)

Essentials of Information Theory

|Cn |/|Cn1 | is decreasing in n, and

lim

|Cn |
= lim |Cn |1/n .
|Cn1 | n

(1.5.15)

Solution Let (X1 , X2 , . . . , Xn ) N(0,Cn ). Then the quantities h(Xk |Xk1 , . . . , X1 )

are decreasing in k, since
h(Xk |Xk1 , . . . , X1 ) = h(Xk+1 |Xk , . . . , X2 ) h(Xk+1 |Xk , . . . , X1 ),
where the equality follows from the Toplitz assumption and the inequality from the
fact that the conditioning reduces the entropy. Next, we use the result of Problem
1.8b from Section 1.6 that the running averages
1 k
1
h(X1 , . . . Xk ) = h(Xi |Xi1 , . . . X1 )
k
k i=1
are decreasing in k. Then (1.5.14) follows from
1
log[(2 e)k |Ck |].
2
Since h(Xn |Xn1 , . . . , X1 ) is a decreasing sequence, it has a limit. Hence, by the
Cesaro mean theorem
h(X1 , . . . Xk ) =

h(X1 , X2 , . . . Xn )
1 n
= lim h(Xk |Xk1 , . . . , X1 )
n
n n
n
i=1
= lim h(Xn |Xn1 , . . . , X1 ).
lim

Translating this to determinants, we obtain (1.5.15).

The entropypower inequality could be immediately extended to the case of
several summands

n
e2h X1 ++Xn /d e2h(Xi )/d .
i=1

But, more interestingly, the following intermediate inequality holds true. Let
X1 , X2 , . . . , Xn+1 be IID square-integrable random variables. Then

Xi /d
1 n+1 2h i
2h X1 ++Xn /d
e =j
.
(1.5.16)
e
n j=1
As was established, the differential entropy is maximised by a Gaussian distribution, under the constraint that the variance of the random variable under consideration is bounded from above. We will state without proof the following important
result showing that the entropy increases on every summation step in the central
limit theorem.

1.6 Additional problems for Chapter 1

Theorem 1.5.15 Let X1 , X2 , . . . be IID square-integrable random variables with

EXi = 0, and VarXi = 1. Then
X ++X
X ++X
1
n
1
n+1

h
h
.
(1.5.17)
n
n+1

1.6 Additional problems for Chapter 1

Problem 1.1 Let 1 and 2 be alphabets of sizes m and q. What does it mean to
say that f : 1 2 is a decipherable code? Deduce from the inequalities of Kraft
and Gibbs that if letters are drawn from 1 with probabilities p1 , . . . , pm then the
expected word length is at least h(p1 , . . . , pm )/ log q.

Find a decipherable binary code consisting of codewords 011, 0111, 01111,

11111, and three further codewords of length 2. How do you check that the code
you have obtained is decipherable?
2

Solution Introduce = n0 n , the set of all strings with digits from . We send
a message x1 x2 . . . xn 1 as the concatenation f (x1 ) f (x2 ) . . . f (xn ) 2 , i.e. f
extends to a function f : 1 2 . We say a code is decipherable if f is injective.
Krafts inequality states that a prefix-free code f : 1 2 with codewordlengths s1 , . . . , sm exists iff
m

(1.6.1)

i=1

In fact, every decipherable code satisfies this inequality.

Gibbs inequality states that if p1 , . . . , pn and p1 , . . . , pn are two probability distributions then
n

i=1

h(p1 , . . . , pn ) = pi log pi pi log pi ,

(1.6.2)

with equality iff pi pi .

Suppose that f is decipherable with codeword-lengths s1 , . . . , sm . Put pi = qsi /c
m

where c = qsi . Then, by Gibbs inequality,

i=1

h(p1 , . . . , pn ) pi log pi

i=1
n

= pi (si log q log c)

i=1

= pi si log q + pi log c.
i

Essentials of Information Theory

By Krafts inequality, c 1, i.e. log c 0. We obtain that

expected codeword-length pi si h(p1 , . . . , pn )/ log q.
i

In the example, the three extra codewords must be 00, 01, 10 (we cannot take
11, as then a sequence of ten 1s is not decodable). Reversing the order in every
codeword gives a prefix-free code. But prefix-free codes are decipherable. Hence,
the code is decipherable.
In conclusion, we present an alternative proof of necessity of Krafts inequality. Denote s = max si ; let us agree to extend any word in X to the length s,
say by adding some fixed symbol. If x = x1 x2 . . . xsi X , then any word of the
form x1 x2 . . . xsi ysi +1 . . . ys X because x is a prefix. But there are at most qssi of
such words. Summing up on i, we obtain that the total number of excluded words
ssi . But it cannot exceed the total number of words qs . Hence, (1.6.1)
is m
i=1 q
follows:
m

qs qsi qs .
i=1

Problem 1.2
Consider an alphabet with m letters each of which appears with
probability 1/m. A binary Huffman code is used to encode the letters, in order to
minimise the expected codeword-length (s1 + + sm )/m where si is the length of
a codeword assigned to letter i. Set s = max[si : 1 i m], and let n be the number
of codewords of length .
(a) Show that 2 ns m.
(b) For what values of m is ns = m?
(c) Determine s in terms of m.
(d) Prove that ns1 + ns = m, i.e. any two codeword-lengths differ by at most 1.
(e) Determine ns1 and ns .
(f) Describe the codeword-lengths for an idealised model of English (with m =
27) where all the symbols are equiprobable.
(g) Let now a binary Huffman code be used for encoding symbols 1, . . . , m occurring with probabilities p1 pm > 0 where p j = 1. Let s1 be the length
1 jm

of a shortest codeword and sm of a longest codeword. Determine the maximal and

minimal values of sm and s1 , and find binary trees for which they are attained.
Solution (a) Bound ns 2 follows from the tree-like structure of Huffman codes.
More precisely, suppose ns = 1, i.e. a maximum-length codeword is unique and
corresponds to say letter i. Then the branch of length s leading to i can be pruned at
the end, without violating the prefix-free condition. But this contradicts minimality.

1.6 Additional problems for Chapter 1

c
b

i
2

Figure 1.10

Bound ns m is obvious. (From what is said below it will follow that ns is always
even.)
(b) ns = m means all codewords are of equal length. This, obviously, happens iff
m = 2k , in which case s = k (a perfect binary tree Tk with 2k leaves).
(c) In general,
'
s=

log m,

if m = 2k ,

log m, if m = 2k .

The case m = 2k was discussed in (b), so let us assume that m = 2k . Then 2k < m <
2k+1 where k = log m. This is clear from the observation that the binary tree for
probabilities 1/m (we will call it a binary m-tree Bm ) contains the perfect binary
tree Tk but is contained in Tk+1 . Hence, s is as above.
(d) Indeed, in the case of an equidistribution 1/m, . . ., 1/m it is impossible to have
a branch of the tree whose length differs from the maximal value s by two or more.
In fact, suppose there is such a branch, Bi , of the binary tree leading to some letter i
and choose a branch M j of maximal length s leading to a letter j. In a conventional
terminology, letter j was engaged in s merges and i in t s 2 merges. Ultimately,
the branches Bi and M j must merge, and this creates a contradiction. For example,
the least controversial picture is still illegal; see Figure 1.10. Here, vertex i
carrying probability 1/m should have been joined with vertex a or b carrying each
probability 2/m, instead of joining a and b (as in the figure), as it creates vertex c
carrying probability 4/m.
(e) We conclude that (i) for m = 2k , the m-tree Bm coincides with Tk , (ii) for m = 2k
we obtain Bm in the following way. First, take a binary tree Tk where k = [log m],
with 1 m 2k < 2k . Then m 2k leaves of Tk are allowed to branch one step

Essentials of Information Theory

k+1 _

_ k
2 (m 2 )

Figure 1.11

further: this generates 2(m 2k ) = 2m 2k+1 leaves of tree Tk+1 . The remaining
2k (m 2k ) = 2k+1 m leaves of Tk are left intact. See Figure 1.11. So,
ns1 = 2k+1 m,

ns = 2m 2k+1 , where k = [log m].

(f) In the example of English, with equidistribution among m = 27 = 16 + 11 symbols, we have 5 codewords of length 4 and 22 codewords of length 5. The average
codeword-length is
5 4 + 22 5 130
=
4.8.
27
27
3
4
(g) The minimal value for s1 is 1 (obviously). The maximal value is log2 m ,
i.e. the positive integer l with 2l < m 2l+1 . The maximal value for sm is m
1 (obviously). The minimal value is log2 m, i.e. the natural l such that 2l1 <
m 2l .
The tree that yields s1 = 1 and sm = m 1 is given in Figure 1.12.
It is characterised by
i
1
2
..
.

f (i)
0
10
..
.

si
1
2
..
.

m1
m

11. . . 10
11. . . 11

m1
m1

and is generated when

p1 > p2 + + pm > 2(p3 + + pm ) > > 2m1 pm .

1.6 Additional problems for Chapter 1

m1 m

Figure 1.12

m = 16

Figure 1.13

A tree that maximises s1 and minimises sm corresponds to uniform probabilities

where p1 = = pm = 1/m. When m = 2l , the branches of the tree have the same
length l = log2 m (a perfect binary tree); see Figure 1.13.
Otherwise, i.e. if 2l < m < 2l+1 , the tree has 2l+1 m leaves at level l and
2(m 2l ) leaves at level l + 1; see Figure 1.14.
Indeed, by the Huffman construction, the shortest branch cannot be larger than
log2 m and the longest shorter than log2 m, as the tree is always a subtree of a
perfect binary tree.
Problem 1.3
A binary erasure channel with erasure probability p is a discrete
memoryless binary channel (MBC) with channel matrix

1 p p
0
.
0
p 1 p

State Shannons second coding theorem (SCT) and use it to compute the capacity
of this channel.

100

Essentials of Information Theory

m = 18

Figure 1.14

Solution The SCT states that for an MBC

(capacity) = (maximum information transmitted per letter).
Here the capacity is understood as the supremum over all reliable information rates
while the RHS is defined as
max I(X : Y )
X

where the random variables X and Y represent an input and the corresponding
output.
The binary erasure channel keeps an input letter 0 or 1 intact with probability
1 p and turns it to a splodge with probability p. An input random variable X is
0 with probability and 1 with probability 1 . Then the output random variable
Y takes three values:
P(Y = 0) = (1 p) ,
P(Y = 1) = (1 p)(1 ),
P(Y = ) = p.
Thus, conditional on the value of Y , we have

h(X|Y = 0) = 0,

implying that h(X|Y ) = ph( ).

h(X|Y = 1) = 0,

h(X|Y = ) = h( ),
Therefore,
capacity = max I(X : Y )
= max [h(X) h(X|Y )]
= max [h( ) ph( )]
= (1 p) max h( ) = 1 p,

1.6 Additional problems for Chapter 1

101

because h( ) = log (1 ) log(1 ) attains it maximum value 1 at =

1/2.
Problem 1.4 Let X and Y be two discrete random variables with corresponding
cumulative distribution functions (CDF) PX and PY .
(a) Define the conditional entropy h(X|Y ), and show that it satisfies
h(X|Y ) h(X),

giving necessary and sufficient conditions for equality.

(b) For each [0, 1], the mixture random variable W ( ) has PDF of the form
PW ( ) (x) = PX (x) + (1 )PY (x).

Prove that for all the entropy of W ( ) satisfies:

h(W ( )) h(X) + (1 )h(Y ).
(c) Let hPo ( ) be the entropy of a Poisson random variable Po( ). Show that
hPo ( ) is a non-decreasing function of > 0.
Solution (a) By definition,
h(X|Y ) = h(X,Y ) h(Y )
= P(X = x,Y = y) log P(X = x,Y = y)
x,y

+ P(Y = y) log P(Y = y).

The inequality h(X|Y ) h(X) is equivalent to

h(X,Y ) h(X) + h(Y ),
and follows from the Gibbs inequality pi log
i

pi
0. In fact, take i = (x, y) and
qi

pi = P(X = x,Y = y), qi = P(X = x)P(Y = y).

Then
P(X = x,Y = y)
P(X = x)P(Y = y)
x,y
= P(X = x,Y = y) log P(X = x,Y = y)
x,y

P(X = x,Y = y) log P(X = x) + log P(Y = y)

0 P(X = x,Y = y) log

x,y

= h(X,Y ) + h(X) + h(Y ).

Equality here occurs iff X and Y are independent.

102

Essentials of Information Theory

(b) Define a random variable T equal to 0 with probability and 1 with probability
1 . Then the random variable Z has the distribution W ( ) where
'
Z=

if T = 0,

if T = 1.

By part (a),
h(Z|T ) h(Z),
with the LHS = h(X) + (1 )h(Y ), and the RHS = h(W ( )).
(c) Observe that for independent random variables X and Y , h(X + Y |X) =
h(Y |X) = h(Y ). Hence, again by part (a),
h(X +Y ) h(X +Y |X) = h(Y ).
Using this fact, for all 1 < 2 , take X Po(1 ), Y Po(2 1 ), independently.
Then
h(X +Y ) h(X) implies hPo (2 ) hPo (1 ).

Problem 1.5
What does it mean to transmit reliably at rate R through a binary
symmetric channel (MBSC) with error-probability p? Assuming Shannons second coding theorem (SCT), compute the supremum of all possible reliable transmission rates of an MBSC. What happens if: (i) p is very small; (ii) p = 1/2; or
(iii) p > 1/2?
Solution An MBSC can
reliably at rate R if there is a sequence of codes
8 transmit
9
NR
codewords such that
XN , N = 1, 2, . . ., with 2

e(XN ) = max P error|x sent 0 as N .
xXN

By the SCT, the so-called operational channel capacity is sup R = max I(X : Y ),
the maximum information transmitted per input symbol. Here X is a Bernoulli
random variable taking values 0 and 1 with probabilities [0, 1] and 1 , and
Y is the output random variable for the given input X. Next, I(X : Y ) is the mutual
entropy (information):
I(X : Y ) = h(X) h(X|Y ) = h(Y ) h(Y |X).

1.6 Additional problems for Chapter 1

103

Observe that the binary entropy function h(x) 1 with equality for x = 1/2.
Selecting = 1/2 conclude that the MBSC with error probability p has the
capacity

max I(X : Y ) = max h(Y ) h(Y |X)

= max h( p + (1 )(1 p)) (p)

= 1 + p log p + (1 p) log(1 p).

(i) If p is small, the capacity is only slightly less than 1 (the capacity of a noiseless
channel).
(ii) If p = 1/2, the capacity is zero (the channel is useless).
(iii) If p > 1/2, we may swap the labels on the output alphabet, replacing p by
1 p, and the channel capacity is non-zero.
Problem 1.6 (i) What is bigger e or e ?
(ii) Prove the log-sum inequality: for non-negative numbers a1 , a2 , . . . , an and
b1 , b2 , . . . , bn ,
a

i
ai
i

(1.6.3)
ai log bi ai log bi
i
i
i

with equality iff ai /bi = const .

(iii) Consider two discrete probability distributions p(x) and q(x). Define the relative entropy (or KullbackLeibler distance) and prove the Gibbs inequality,

p(x)
0,
(1.6.4)
D(pq) = p(x) log
q(x)
x
with equality iff p(x) = q(x) for all x.
Using (1.6.4), show that for any positive functions f (x) and g(x), and for any
finite set A,
f (x)

f (x)
xA
f (x) log g(x) f (x) log g(x) .
xA
xA
xA

Check that that for any 0 p, q 1,

1 p
p
+ (1 p) log
(2 log2 e)(q p)2 ,
p log
q
1q

(1.6.5)

and show that for any probability distributions p = (p(x)) and q = (q(x)),
2

log2 e
D(pq)
(1.6.6)
|p(x) q(x)| .
2
x

104

Essentials of Information Theory

Solution (i) Denote x = ln , and taking the logarithm twice obtain the inequality
x 1 > ln x. This is true as x > 1, hence e > e .
(ii) Assume without loss of generality that ai > 0 and bi > 0. The function g(x) =
x log x is strictly convex. Hence, by the Jensen inequality for any coefficients ci =
1, ci 0,

ci g(xi ) g ci xi .

1
and xi = ai /bi , we obtain
Selecting ci = bi b j
j

b j log bi b j
i

log

ai
i

which is the log-sum inequality.

(iii) There exists a constant c > 0 such that

1
log y c 1
, with equality iff y = 1.
y
Writing B = {x : p(x) > 0},
p(x)
p(x) log
xB
q(x)

q(x)
c p(x) 1
= c 1 q(B) 0.
p(x)
xB

D(pq) =

Equality holds iff q(x) p(x). Next, write

f (x)
1(x A),
f (A)
g(x)
1(x A).
q(x) =
g(A)

f (A) = f (x), p(x) =

g(A) = g(x),
xA

Then
f (x)

f (A)p(x)

f (x) log g(x) = f (A) p(x) log g(A)q(x)

= f (A)

p(x)

p(x) log q(x)

by the previous part

f (A) log

f (A)
.
g(A)

+ f (A) log

f (A)
g(A)

1.6 Additional problems for Chapter 1

105

Inequality (1.6.5) could be easily established by inspection. Finally, consider

A = {x : p(x) q(x)}. Since

|p(x) q(x)| = 2 q(A) p(A) = 2 p(Ac ) q(Ac ) ,

then
p(x)
p(x)
+ p(x) log
q(x) xAc
q(x)
xA
c)
p(A
p(A)
+ p(Ac ) log
p(A) log
q(A)
q(Ac )
2

2 log2 e
2 log2 e p(A) q(A) =
|p(x) q(x)| .
2
x

D(p||q) = p(x) log

Problem 1.7 (a) Define the conditional entropy, and show that for random variables U and V the joint entropy satisfies
h(U,V ) = h(V |U) + h(U).

Given random variables X1 , . . . , Xn , by induction or otherwise prove the chain rule

h(X1 , . . . Xn ) = h(Xi |X1 , . . . , Xi1 ).

(1.6.7)

i=1

(b) Define the subset average over subsets of size k to be

n
h(XS )
(n)
,
hk =
k
k
S:|S|=k

where h(XS ) = h(Xs1 , . . . , Xsk ) for S = {s1 , . . . , sk }. Assume that, for any i,
h(Xi |XS ) h(Xi |XT ) when T S, and i
/ S.
By considering terms of the form
h(X1 , . . . , Xn ) h(X1 , . . . Xi1 , Xi+1 , . . . , Xn )
(n)

(n)

show that hn hn1 .

(k)

(n)

Using the fact that hk hk1 , show that hk hk1 , for k = 2, . . . , n.

(n)

t1 t2 tn .

106

Essentials of Information Theory

Solution (a) By definition, the conditional entropy

h(V |U) = h(U,V ) h(U)
= P(U = u)h(V |U = u),
u

where h(V |U = u) is the entropy of the conditional distribution:

h(V |U = u) = P(V = v|U = u) log P(V = v|U = u).
v

The chain rule (1.6.7) is established by induction in n.

(b) By the chain rule
h(X1 , . . . , Xn ) = h(X1 , . . . , Xn1 ) + h(Xn |X1 , . . . , Xn1 )

(1.6.8)

and, in general,
h(X1 , . . . , Xn )
= h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi1 , Xi+1 , . . . , Xn )
h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi1 ),

(1.6.9)

because
h(Xi |X1 , . . . , Xi1 , Xi+1 , . . . , Xn ) h(Xi |X1 , . . . , Xi1 ).
Then adding equations (1.6.9) from i = 1 to n:
n

nh(X1 , . . . , Xn ) h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn )

i=1

+ h(Xi |X1 , . . . , Xi1 ).

i=1

The second sum in the RHS equals h(X1 , . . . , Xn ) by the chain rule (1.6.7). So,
n

(n 1)h(X1 , . . . , Xn ) h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn ).

i=1

(n)

This implies that hn hn1 , since

1 n h(X1 , . . . , Xi1 , Xi+1 , . . . , Xn )
1
h(X1 , . . . , Xn )
.
n
n i=1
n1

(1.6.10)

In general, fix a subset S of size k in {1, . . . , n}. Writing S(i) for S \{i}, we obtain
1
h(X[S(i)])
1
h[X(S)]
,
k
k iS k 1

1.6 Additional problems for Chapter 1

107

by the above argument. This yields

h[X(S)]
h(X[S(i)])
n (n)

.
h =

k k
k
k(k 1)
S{1,...,n}: |S|=k
S{1,...,n}: |S|=k iS

(1.6.11)

Finally, each subset of size k 1, S(i), appears [n (k 1)] times in the sum
(n)
(1.6.11). So, we can write hk as

n
h[X(T )] n (k 1)

k
k
T {1,...,n}: |T |=k1 k 1

n
h[X(T )]
=
= hnk1 .

1
k

1
T {1,...,n}: |T |=k1
(c) Starting from (1.6.11), exponentiate and then apply the arithmetic
mean/geometric mean inequality, to obtain for S0 = {1, 2, . . . , n}
e h(X(S0 ))/n e [h(S0 (1))++h(S0 (n))]/(n(n1))
(n)

1 n h(S0 (i))/(n1)
e
n i=1

(n)

which is equivalent to tn tn1 . Now we use the same argument as in (b), taking
(n)

(n)

the average over all subsets to prove that for all k n,tk tk1 .
Problem 1.8
Prove that

Let p1 , . . . , pn be a probability distribution, with p = maxi [pi ].

(i) pi log2 pi p log2 p (1 p ) log2 (1 p );

(ii) pi log2 pi log2 (1/p );

(iii) pi log2 pi 2(1 p ).

The random variables X and Y with values x and y from finite alphabets I and
J represent the input and output of a transmission channel, with the conditional
probability P(x | y) = P(X = x | Y = y). Let h(P( | y)) denote the entropy of the
conditional distribution P( | y), y J , and h(X | Y ) denote the conditional entropy
of X given Y . Define the ideal observer decoding rule as a map f : J I such
that P( f (y) | y) = maxxI P(x | y) for all y J . Show that under this rule the errorprobability
er (y) =

P(x | y)

xI: x= f (y)

1
satisfies er (y) h(P( | y)), and the expected error satisfies
2
1
Eer (Y ) h(X | Y ).
2

108

Essentials of Information Theory

Solution Bound (i) follows from the pooling inequality. Bound (ii) holds as
pi log pi pi log
i

1
1
= log .

p
p

To check (iii), it is convenient to use (i) for p 1/2 and (ii) for p 1/2. Assume
first that p 1/2. Then, by (i),
h(p1 , . . . , pn ) h (p , 1 p ) .
The function x (0, 1) h(x, 1 x) is concave, and its graph on (1/2, 1) lies
strictly above the line x 2(1 x). Hence,
h(p1 , . . . , pn ) 2 (1 p ) .
On the other hand, if p 1/2, we use (ii):
h(p1 , . . . , pn ) log

1
.
p

Further, for 0 x 1/2,

log

1
1
2(1 x); equality iff x = .
x
2

For the concluding part, we use (iii). Write

er (y) = 1 Pch ( f (y)|y) = 1 pmax ( |y)

which is h P( |y) /2.Finally, the mean Eer (Y ) is bounded by taking expectations, since h(X|Y ) = Eh P( |Y ) .
Problem 1.9
Define the information rate H and the asymptotic equipartition
property of a source. Calculate the information rate of a Bernoulli source. Given a
memoryless binary channel, define the channel capacity C. Assuming the statement
of Shannons second coding theorem (SCT), deduce that C = sup pX I(X : Y ).
An erasure channel keeps a symbol intact with probability 1 p and turns it into
an unreadable splodge with probability p. Find the capacity of the erasure channel.
Solution The information rate H of a source U1 ,U2 , . . . with a finite alphabet I is
the supremum of all values R > 0 such that there exists a sequence of sets An
I I (n times) such that |An | 2nR and limn P(U1n An ) = 1.
The asymptotic equipartition property means that, as n ,
1
log pn (U1n ) H,
n

1.6 Additional problems for Chapter 1

109

in one sense or another (here we mean convergence in probability). Here U1n =

U1 . . . Un and pn (un1 ) = P(U1n = un1 ). The SCT states that if the random variable
log pn (U1n )/n converges to a limit then the limit equals H.
A memoryless binary channel (MBC) has the conditional probability

Pch Y(N) |X(N) sent = P(yi |xi )
1iN

and produces an error with probability

(N) = Psource U = u Pch fN (Y(N) ) = u | fN (u) sent ,
u

where Psource stands for the source probability distribution, and one uses a code fN
and a decoding rule fN . A value R (0, 1) is said to be a reliable transmission rate
if, given that Psource is an equidistribution over a set UN of source strings u with
UN = 2N[R+o(1)] , there exist fN and fN such that

1
(N)

f
P
(Y
)
=

u
|
f
(u)
sent
= 0.
lim
N
N
ch
N UN
uUN
The channel capacity is the supremum of all reliable transmission rates.
For an erasure channel, the matrix is

1 p
0
p
0
1 p p
1 0
0
1

The conditional entropy h(Y |X) = h(p, 1 p) does not depend on pX . Thus,
C = sup I(X : Y ) = sup h(Y ) h(Y |X)
pX

is achieved at pX (0) = pX (1) = 1/2 with

h(Y ) = (1 p) log[(1 p)/2] p log p = h(p, 1 p) + (1 p).
Hence, the capacity C = 1 p.
Problem 1.10 Define Huffmans encoding rule and prove its optimality among
decipherable codes. Calculate the codeword lengths for the symbol-probabilities
1 1 1 1 1 1 1 1
5 , 5 , 6 , 10 , 10 , 10 , 10 , 30 .
Prove, or provide a counter-example to, the assertion that if the length of a codeword from a Huffman code equals l then, in the same code, there exists another
codeword of length l such that | l l | 1.

110

Essentials of Information Theory

Solution An answer to the first part:

probability codeword length
1/5
00
2
1/5
100
3
1/6
101
3
1/10
110
3
1/10
010
3
1/10
011
3
1/10
1110
4
1/30
1111
4
For the second part: a counter-example:
probability codeword length
1/2
0
1
1/8
100
3
1/8
101
3
1/8
110
3
1/8
111
3
Problem 1.11
A memoryless channel with the input alphabet {0, 1} reproduces the symbol correctly with probability (n 1)/n2 and reverses it with probability 1/n2 . [Thus, for n = 1 the channel is binary and noiseless.] For n 2 it
also produces 2(n 1) sorts of splodges, conventionally denoted by i and i ,
i = 1, . . . , n 1, with similar probabilities: P(i |0) = (n 1)/n2 , P(i |0) = 1/n2 ,
P(i |1) = (n 1)/n2 , P(i |1) = 1/n2 . Prove that the capacity Cn of the channel
increases monotonically with n, and limn Cn = . How is the capacity affected
if we simply treat splodges i as 0 and i as 1?
Solution The channel matrix is

n1
1
n1
0 n2
2
n
n2

1
n1
1
1
n2
n2
n2
0
1
1

1
n2
n1
n2
1

...
...
...

n1
n2

1
n2

1
n1
n2
n2
n1
n1

The channel is double-symmetric (the rows and columns are permutations of each
other), hence the capacity-achieving input distribution is
1
pX (0) = pX (1) = ,
2

1.6 Additional problems for Chapter 1

111

and the capacity Cn is given by

1
n1
n2
2
log(n
)
+
n
log
n2
n2
n1
n1
log(n 1) +, as n .
= 1 + 3 log n
n
Furthermore, extrapolating

1
log(x 1), x 1,
C(x) = 1 + 3 log x 1
x
Cn = log(2n) + n

we find
1
1
1
dC(x) 3
=
+
log(x 1)
dx
x x 1 x(x 1) x2
1
2 1
= 2 log(x 1) = 2 [ 2x log(x 1)] > 0, x > 1.
x x
x
Thus, Cn increases with n for n 1. When i and i are treated as 0 or 1, the
capacity does not change.
Problem 1.12
Let Xi , i = 1, 2, . . . , be IID random variables, taking values 1, 0
with probabilities p and (1 p). Prove the local De MoivreLaplace theorem with
a remainder term:
1
exp[nh p (y) + n (k)], k = 1, . . . , n 1; (1.6.12)
P(Sn = k) = $
2 y(1 y)n

here
Sn =

Xi ,

y = k/n,

1in

1y
y
+ (1 y) ln
h p (y) = y ln
p
1 p

and the remainder n (k) obeys

|n (k)| <

1
,
6ny(1 y)

y = k/n.

Hint: Use the Stirling formula with the remainder term

n! = 2 n, nn en+ (n) ,
where
1
1
< (n) <
.
12n + 1
12n

Find values k+ and k , 0 k+ , k n (depending on n), such that P(Sn = k+ )

is asymptotically maximal and P(Sn = k ) is asymptotically minimal, as n ,
and write the corresponding asymptotics.

112

Essentials of Information Theory

Solution Write

n!
n
(1 p)nk pk
(1 p)nk pk =
P(Sn = k) =
k
k!(n

k)!
?
nn
n
=
(1 p)nk pk
nk
2 k(n k) kk (n

exp (n) (k) (n k)

1
=$
exp k ln y (n k) ln(1 y)
2 ny(1 y)

+k ln p + (n k) ln(1 p) exp (n) (k) (n k)

1
=$
exp nh p (y)
2 ny(1 y)

exp (n) (k) (n k) .

Now, as
| (n) (k) (n k)| <

1
1
2n2
1
+
+
<
,
12n 12k 12(n k) 12nk(n k)

(1.6.12) follows, with n (k) = (n) (k) (n k). By the Gibbs inequality,
h p (y) 0 and h p (y) = 0 iff y = p. Furthermore,
d2 h p (y) 1
dh p (y)
y
1y
1
= ln ln
, and
> 0,
= +
2
dy
p
1 p
dy
y 1y
which yields

dh p (y)
dh p (y)
dh p (y)
< 0, 0 < y < p, and
> 0, p < y < 1.
= 0,

dy y=p
dy
dy

Hence,
at y = p,
h p = min h p (y) = 0, attained

1
1
h p = max h p (y) = min ln , ln
, attained at y = 0 or y = 1.
p
1 p
Thus, the maximal probability for n 1 is for y = p, i.e. k+ = np:

1
exp n (np) ,
P Sn = np $
2 np(1 p)
where
|n (np)|

1
.
6np(1 p)

Similarly, the minimal probability is

P(Sn = 0) = pn ,
if 0 < p 1/2,
n
P(Sn = n) = (1 p) , if 1/2 p < 1.

1.6 Additional problems for Chapter 1

113

Problem 1.13

(a) Prove that the entropy h(X) = p(i) log p(i) of a discrete
i=1

random variable X with probability distribution p = (p(1), . . . , p(n)) is a concave

function of the vector p.
Prove that the mutual entropy I(X : Y ) = h(Y ) h(Y | X) between random variables X and Y , with P(X = i,Y = k) = pX (i)PY |X (k | i), i, k = 1, . . . , n, is a concave
function of the vector pX = (pX (1), . . . , pX (n)) for fixed conditional probabilities
{PY |X (k | i)}.
(b) Show that
h(X) p log2 p (1 p ) log2 (1 p ),

where p = maxx P(X = x), and deduce that, when p 1/2,

h(X) 2(1 p ).

(1.6.13)

Show also that inequality (1.6.13) remains true even when p < 1/2.
Solution (a) Concavity of h(p) means that
h(1 p1 + 2 p2 ) 1 h(p1 ) + 2 h(p2 )

(1.6.14)

for any probability vectors p j = (p j (1), . . . , p j (n)), j = 1, 2, and any 1 , 2 (0, 1)

with 1 + 2 = 1. Let X1 have distribution p1 and X2 have distribution p2 . Let also
Z = 1, with probability 1 or 2, with probability 2 ,
and Y = XZ . Then the distribution of Y is 1 p1 + 2 p2 . By Theorem 1.2.11(a),
h(Y ) h(Y |Z),
and by the definition of the conditional entropy
h(Y |Z) = 1 h(X1 ) + 2 h(X2 ).
This yields (1.6.14). Now

I(X : Y ) = h(Y ) h(Y |X) = h(Y ) pX (i)h PY |X (.|i) .

(1.6.15)

If PY |X (.|.) are fixed, the second term is a linear function of pX , hence concave. The
first term, h(Y ), is a concave function of pY which in turn is a linear function of
pX . Thus, h(Y ) is concave in pX , and so is I(X : Y ).
(b) Consider two cases, (i) p 1/2 and (ii) p 1/2. In case (i), by pooling
inequality,
h(X) h(p , 1 p ) (1 p ) log

1
1
(1 p ) log = 2(1 p )
p (1 p )
4

114

Essentials of Information Theory

as p 1/2. In case (ii) we use induction in n, the number of values taken by X.

The initial step is n = 3: without loss of generality, assume that p = p1 p2 p3 .
Then 1/3 p1 < 1/2 and (1 p1 )/2 p2 p1 . Write
p2
.
h(p1 , p2 , p3 ) = h(p1 , 1 p1 ) + (1 p1 )h(q, 1 q), where q =
1 p1
As 1/2 q p1 (1 p1 ) 1,
h(q, 1 q) h(p1 /(1 p1 ), (1 2p1 )/(1 p1 )),
i.e.
h(p1 , p2 , p3 ) h(p1 , p1 , 1 2p1 ) = 2p1 + h(2p1 , 1 2p1 ).
The inequality 2p1 + h(2p1 , 1 2p1 ) 2(1 p1 ) is equivalent to
h(2p1 , 1 2p1 ) > 2 4p1 , 1/3 p1 < 1/2,
or to
h(p, 1 p) > 2 2p, 2/3 p 1,
which follows from (a). Thus, for n = 3, h(p1 , p2 , p3 ) 2(1 p ) regardless of the
value of p . The initial induction step is completed.
Make the induction hypothesis h(X) 2(1 p ) for the number of values of X
which is n 1. Then take p = (p1 , . . . , pn ) and assume without loss of generality
that p = p1 pn . Write q = (p2 /(1 p1 ), . . . , pn1 /(1 p1 )) and
h(p) = h(p1 , 1 p1 ) + (1 p1 )h(q) h(p1 , 1 p1 ) + (1 p1 )2(1 q1 ). (1.6.16)
The inequality h(p) 2(1 p ) will follow from
h(p1 , 1 p1 ) + (1 p1 )(2 2q1 ) (2 2p1 )
which is equivalent to
h(p1 , 1 p1 ) 2(1 p1 )(1 1 + q1 ) = 2(1 p1 )q1 = 2p2 ,
for 1/n p1 < 1/2, (1 p1 )/(n 1) p2 < p1 . But obviously
h(p1 , 1 p1 ) 2(1 p1 ) 2p2
(with equality at p1 = 0, 1/2). Inequality (1.6.16) follows from the induction
hypothesis.
Problem 1.14 Let a probability distribution pi , i I = {1, 2, . . . , n}, be such that
log2 (1/pi ) is an integer for all i with pi > 0. Interpret I as an alphabet whose letters
are to be encoded by binary words. A ShannonFano (SF) code assigns to letter i
a word of length i = log2 (1/pi ); by the Kraft inequality it may be constructed

1.6 Additional problems for Chapter 1

115

to be uniquely decodable. Prove the competitive optimality of the SF codes: if i ,

i I , are the codeword-lengths of any uniquely decodable binary code then

P i < i P i < i ,
(1.6.17)
with equality iff i i .

Hint: You may find useful the inequality sgn( ) 2 1, , = 1, . . . , n.
Solution Write
P( i < i ) P( i > i ) =

i: i <i

i: i >i

= pi sign (i i )
i

= E sgn( ) E 2 1 ,
as sign x 2x 1 for integer x. Continue the argument with

E 2 1 = pi 2i i 1
i

= 2i 2i i 1 =
i

2 2
i

1 2i = 1 1 = 0
i

by the Kraft inequality. This yields the inequality

P i < i P i < i .

To have equality, we must have (a) 2i i 1 = 0 or 1, i I (because sign x = 2x 1

only for x = 0 or 1), and (b) 2i = 1. As 2i = 1, the only possibility is
i

2i i 1, i.e. i = i .

Problem 1.15
Define the capacity C of a binary channel. Let CN =
(N)
(N)
(1/N) sup I(X : Y ), where I(X(N) : Y(N) ) denotes the mutual entropy between
X(N) , the random word of length N sent through the channel, and Y(N) , the received
word, and where the supremum is over the probability distribution of X(N) . Prove
that C lim supN CN .
Solution A binary channel is defined as a sequence of conditional probability distributions
(N)

Pch (y(N) |x(N) ), N = 1, 2, . . . ,

116

Essentials of Information Theory

where x(N) = x1 . . . xN is a binary word (string) at the input and y(N) = y1 . . . yN a

binary word (string) at the output port. The channel capacity C is an asymptotic
* (N)
+
parameter of the family Pch ( | ) defined by

C = sup R (0, 1) : R is a reliable transmission rate .

(1.6.18)
Here, a number R (0, 1) is called a reliable transmission rate (for a given channel)
if, given that the random source string is equiprobably distributed over a set U (N)
with U (N) = 2N[R+O(1)] , there exist an encoding rule f (N) : U (N) XN {0, 1}N
and a decoding rule f(N) : {0, 1}N U (N) such that the error probability e(N) 0
as N is given by

(

)
1
(N)
(N)
(N) y(N) = u | f (N) (u) ;
P
:
f
(1.6.19)
y
e(N) :=
U (N) ch
uU (N)
note

e(N) = e(N) f (N) , f(N) .

The converse part of Shannons second coding theorem (SCT) states that

1
(1.6.20)
C lim sup sup I X(N) : Y(N) ,
N N P (N)
X

where I X(N) : Y(N) is the mutual entropy between the random input and output
strings X(N) and Y(N) and PX(N) is a distribution of X(N) .
For the proof, it suffices to check that if U (N) = 2N[R+O(1)] then, for all f (N)
and f(N) ,
CN + o(1)
e(N) 1
(1.6.21)
R + o(1)
where
CN =

1
sup I X(N) : Y(N) .
N P (N)
X

R > lim supN CN

then,
Indeed,
if
lim infN inf f (N) , f(N) e(N) > 0 and R is not reliable.

according

(1.6.21)

To prove (1.6.21), assume, without loss of generality,

that f (N) is lossless. Then
the input word x(N) is equidistributed, with probability 1 U (N) . For all decoding rules f(N) and any N large enough,

clNCN I X(N) : Y(N) I X(N) : f Y(N)

= h X(N) h X(N) | f Y(N)

= log U (N) h X(N) | f Y(N)

log U (N) 1 (N) log U (N) 1 .
(1.6.22)

1.6 Additional problems for Chapter 1

117

The last bound here follows from the generalised Fano inequality

h X(N) | f Y(N) e(N) log e(N) 1 e(N) log 1 e(N)
+e(N) log U (N) 1
1 + e(N) log U (N) 1 .
Now, from (1.6.22),

NCN N R + o(1) 1 e(N) log 2N[R+o(1)] 1 ,

i.e.
(N)

N R + o(1) NCN 1
CN + o(1)

= 1
,

R + o(1)
log 2N[R+o(1)] 1

as required.
Problem 1.16
A memoryless channel has input 0 and 1, and output 0, 1 and
(illegible). The channel matrix is given by
P(0|0) = 1, P(0|1) = P(1|1) = P(|1) = 1/3.

Calculate the capacity of the channel and the input probabilities pX (0) and pX (1)
for which the capacity is achieved.
Someone suggests that, as the symbol may occur only from 1, it is to your
advantage to treat as 1: you gain more information from the output sequence, and
it improves the channel capacity. Do you agree? Justify your answer.
Solution Use the formula

C = sup I(X : Y ) = sup h(Y ) h(Y |X) ,

where pX is the distribution of the input symbol:

pX (0) = p, pX (1) = 1 p, 0 p 1.
So, calculate I(X : Y ) as a function of p:
h(Y ) = pY (0) log pY (0) pY (1) log pY (1) pY () log pY ().
Here
pY (0) = p + (1 p)/3 = (1 + 2p)/3,
pY (1) = pY () = (1 p)/3,
and
h(Y ) =

1 + 2p 2(1 p)
1 p
1 + 2p
log

log
.
3
3
3
3

118

Essentials of Information Theory

Also,
h(Y |X) = pX (x) P(y|x) log P(y|x)
x=0,1

= pX (1) log 1/3 = (1 p) log 3.

Thus,
I(X : Y ) =

1 + 2p 2(1 p)
1 p
1 + 2p
log

log
(1 p) log 3.
3
3
3
3

Differentiating yields
d
I(X : Y ) = 2/3(log (1/3 + 2p/3) + 2/3 log (1/3 p/3) + log 3.
dp
Hence, the maximum max I(X : Y ) is found from relation
1 p
2
log
+ log 3 = 0.
3
1 + 2p
This yields
log
and

3
1 p
= log 3 := b,
1 + 2p
2

1 p
= 2b , i.e. 1 2b = p 1 + 2b+1 .
1 + 2p

The answer is
p=

1 2b
.
1 + 2b+1

For the last part, write

I(X : Y ) = h(X) h(X|Y ) h(X) h(X|Y ) = I(X : Y )
for any Y that is a function of Y ; the equality holds iff Y and X are conditionally
independent, given Y . It is the case of our channel, hence the suggestion leaves the
capacity the same.
Problem 1.17
(a) Given a pair of discrete random variables X , Y , define the
joint and conditional entropies h(X,Y ) and h(X|Y ).
(b) Prove that h(X,Y ) h(X|Y ) and explain when equality holds.
(c) Let 0 < < 1, and prove that

h(X|Y ) log( 1 ) P(q(X,Y ) ),

where q(x, y) = P(X = x|Y = y). For which and for which X , Y does equality
hold here?

1.6 Additional problems for Chapter 1

119

Solution (a) The conditional entropy is given by

h(X|Y ) = E log q(x, y) = P(X = x,Y = y) log q(x, y)
x,y

where
q(x, y) = P(X = x|Y = y).
The joint entropy is given by
h(X,Y ) = P(X = x,Y = y) log P(X = x,Y = y).
x,y

(b) From the definition,

h(X,Y ) = h(X|Y ) P(Y = y) log P(Y = y) h(X|Y ).
y

The equality in (b) is achieved iff h(Y ) = 0, i.e. Y is constant a.s.

1
1

E log q(X,Y ) =
h(X|Y ).
log 1/
log 1/
Here equality holds iff

P q(X,Y ) = ) = 1.

This requires that (i) = 1/m where m is a positive integer and (ii) for all y
support Y , there exists a set Ay of cardinality m such that
P(X = x|Y = y) =

1
,
m

for x Ay .

Problem 1.18 A text is produced by a Bernoulli source with alphabet 1, 2, . . . , m

and probabilities p1 , p2 , . . . , pm . It is desired to send this text reliably through a
memoryless binary symmetric channel (MBSC) with the row error-probability p .
Explain what is meant by the capacity C of the channel, and show that
C = 1 h(p , 1 p ).

Explain why reliable transmission is possible if

h(p1 , p2 , . . . , pm ) + h(p , 1 p ) < 1

and is impossible if
h(p1 , p2 , . . . , pm ) + h(p , 1 p ) > 1,
m

where h(p1 , p2 , . . . , pm ) = pi log2 pi .

i=1

120

Essentials of Information Theory

Solution The asymptotic equipartition property for a Bernoulli source states that
the number of distinct strings (words) of length n emitted by the source is typically 2nH+o(n) , and they have nearly equal probabilities 2nH+o(n) :

lim P 2n(H+ ) Pn (U(n) ) 2n(H ) = 1.
n

Here, H = h(p1 , . . . , pn ).
Denote
(
)
Tn (= Tn ( )) = u(n) : 2n(H+ ) Pn (u(n) ) 2n(H )
and observe that
1
1
log Tn = H, i.e. lim sup log Tn < H + .
n n
n n
lim

By the definition of the channel capacity, the words u(n) Tn ( ) may be encoded
1
by binary codewords of length R (H + ) and sent reliably through a memoryless
symmetric channel with matrix

p
1 p
p
1 p
for any R < C where
C = sup I(X : Y ) = sup[h(Y ) h(Y |X)].
pX

The supremum here is taken over all distributions

pX = (pX (0), pX (1))
of the input binary symbol X; the conditional distribution of the output symbol Y
is given by
'
1 p , y = x,
P(Y = y|X = x) =
y = x.
p ,
We see that

h(Y |X) = pX (0) (1 p ) log(1 p ) p log p

+pX (1) p log p (1 p ) log(1 p ) = h(p , 1 p ),

1.6 Additional problems for Chapter 1

121

independently of pX . Hence,
C = sup h(Y ) h(p , 1 p ) = 1 h(p , 1 p ),
pX

because h(Y ) is achieved for

pY (0) = pY (1) = 1/2 (occurring when pX (0) = pX (1) = 1/2).
Therefore, if
H <C

h(p1 , . . . , pn ) + h(p , 1 p ) < 1,

then R (H + ) can be made < 1, for > 0 small enough and R < C close to
C. This means that there exists a sequence of codes fn of length n such that the
error-probability, while using encoding fn and the ML decoder, is

P u(n) Tn

+P u(n) Tn ; an error while using fn (u(n) ) and the ML decoder
0, as n ,
since both probabilities go to 0.
On the other hand,
H >C

h(p1 , . . . , pn ) + h(p , 1 p ) > 1,

then R H > 1 for all R < C, and we cannot encode words u(n) Tn by codewords
of length n so that the error-probability tends to 0. Hence, no reliable transmission
is possible.
Problem 1.19 A Markov source with an alphabet of m characters has a transition
matrix Pm whose elements p jk are specified by
p11 = pmm = 2/3, p j j = 1/3 (1 < j < m),
p j j+1 = 1/3 (1 j < m), p j j1 = 1/3 (1 < j m).

All other elements are zero. Determine the information rate of the source.
. Consider
Denote the transition matrix thus specified by Pm
a source in an
Pm 0
, where the zeros
alphabet of m + n characters whose transition matrix is
0 Pn
indicate zero matrices of appropriate size. The initial character is supposed uniformly distributed over the alphabet. What is the information rate of the source?

122

Essentials of Information Theory

Solution The transition matrix

2/3 1/3 0
0 ... 0
1/3 1/3 1/3 0 . . . 0

Pm = 0 1/3 1/3 1/3 . . . 0

.
.
.
.
.
..
..
..
..
..
..
0
0
0
0 . . . 2/3
is Hermitian and so has the equilibrium distribution = (i ) with i = 1/m, 1
i m (equidistribution). The information rate equals
Hm = j p jk log p jk
j,k

2
2 1
1
1
1
1
2
log + log
+ 3(m 2) log
=
m
3
3 3
3
3
3
4
.
= log 3
3m

Pm 0
The source with transition matrix
is non-ergodic, and its information
0 Pn
rate is the maximum of the two rates

max Hm , Hn = Hmn .
Problem 1.20
Consider a source in a finite alphabet. Define Jn = n1 h(U(n) )
and Kn = h(Un+1 |U(n) ) for n = 1, 2, . . .. Here Un is the nth symbol in the sequence
and U(n) is the string constituted by the first n symbols, h(U(n) ) is the entropy and
h(Un+1 |U(n) ) the conditional entropy. Show that, if the source is stationary, then Jn
and Kn are non-increasing and have a common limit.
Suppose the source is Markov and not necessarily stationary. Show that the
mutual information between U1 and U2 is not smaller than that between U1 and U3 .
Solution For the second part, the Markov property implies that
P(U1 = u1 |U2 = u2 ,U3 = u3 ) = P(U1 = u1 |U2 = u2 ).
Hence,

P(U1 = u1 |U2 = u2 ,U3 = u3 )

I(U1 : (U2 ,U3 )) = E log

P(U1 = u1 )

P(U1 = u1 |U2 = u2 )

= E log
= I(U1 : U2 ).
P(U1 = u1 )
Since
I(U1 : (U2 ,U3 )) I(U1 : U3 ),
the result follows.

1.6 Additional problems for Chapter 1

123

Problem 1.21 Construct a Huffman code for a set of 5 messages with probabilities as indicated below
1
2
3
4
5
Message
Probability 0.1 0.15 0.2 0.26 0.29
Solution
Message
1
2
3
4
5
Probability 0.1 0.15 0.2 0.026 0.029
Codeword 101 100 11 01
00
The expected codeword-length equals 2.4.
Problem 1.22
State the first coding theorem (FCT), which evaluates the information rate for a source with suitable long-run properties. Give an interpretation of
the FCT as an asymptotic equipartition property. What is the information rate for a
Bernoulli source?
Consider a Bernoulli source that emits symbols 0, 1 with probabilities 1 p and
p respectively, where 0 < p < 1. Let (p) = p log p (1 p) log(1 p) and let
> 0 be fixed. Let U(n) be the string consisting of the first n symbols emitted by
the source. Prove that there is a set Sn of possible values of U(n) such that

(n)

P U Sn 1 log

and so that for each u(n)

2n(h+ ) and 2n(h ) .

p(1 p)
,
n 2

Sn the probability that P U(n) = u(n) lies between
p
1 p

Solution For the Bernoulli source

1
1
log Pn (U(n) ) = log P(U j ) (p),
n
n 1 jn
in the sense that for all > 0, by Chebyshev,

1

(n)

P log Pn (U ) h >
n

1
2 2 Var
log P(U j )
n
1 jn
=

1
Var [log P(U1 )] .
2n

(1.6.23)

124

Essentials of Information Theory

Here

'
P(U j ) =

and

1 p,

if U j = 0,

if U j = 1,

log P(U j )

1 jn

where

P(U j ),

1 jn

Var

Pn (U(n) ) =

Var log P(U j )

1 jn

2
Var log P(U j ) = E log P(U j ) E log P(U j )

2
= p(log p)2 + (1 p)(log(1 p))2 p log p + (1 p) log(1 p)
2

p
= p(1 p) log
.
1 p

Hence, the bound (1.6.23) yields

P 2n(h+ ) Pn (U(n) ) 2n(h )

2
p
1
.
1 2 p(1 p) log
n
1 p
It now suffices to set
Sn = {u(n) = u1 . . . un : 2n(h+ ) P(U(n) = u(n) ) 2n(h ) },
and the result follows.
Problem 1.23
The alphabet {1, 2, . . . , m} is to be encoded by codewords with
letters taken from an alphabet of q < m letters. State Krafts inequality for the wordlengths s1 , . . . , sm of a decipherable code. Suppose that a source emits letters from
the alphabet {1, 2, . . . , m}, each letter occurring with known probability pi > 0. Let
S be the random codeword-length resulting from the letter-by-letter encoding of the
source output. It is desired to find a decipherable code that minimises the expected

2
S

S
value of q . Establish the lower bound E q
pi , and characterise

1im

when equality occurs.

Prove also that an optimal code for the above criterion must satisfy E qS <
2

pi .
q

1im

Hint: Use the CauchySchwarz inequality: for all positive xi , yi ,

1/2
1/2

xi yi

1im

with equality iff xi = cyi for all i.

1im

xi2

1im

y2i

1.6 Additional problems for Chapter 1

125

Solution By CauchySchwarz,
pi = pi qsi /2 qsi /2
1im
1/2
1/2
1/2

s
s
s
i
i
i

,
q
pi q
pi q
1/2

1/2

1im

since, by Kraft,

1im

qsi 1. Hence,

1im

EqS =

pi qsi

1im

1/2
pi

1im

Now take the probabilities pi to be

pi = (cqxi )2 , xi > 0,
where

qxi = 1 (so,

1im

1/2

1im

= c). Take si to be the smallest integer xi .

qsi 1 and, again by Kraft, there exists a decipherable coding with

1/2
the codeword-length si . For this code, qsi 1 < qxi = c pi , and hence
Then

1im

EqS =

pi qsi = q

1im

pi qsi 1

1im

pi qxi = qc

1im

1/2
pi

= q

1/2
pi

1im

Problem 1.24
A Bernoulli source of information of rate H is fed characterby-character into a transmission line which may be live or dead. If the line is live
when a character is transmitted then that character is received faithfully; if the line
is dead then the receiver learnt only that it is indeed dead. In shifting between
its two states the line follows a Markov chain (DTMC) with constant transition
probabilities, independent of the text being transmitted.
Show that the information rate of the source constituted by the received signal
is HL + L HS where HS is the signal, HL is the information rate of the DTMC
governing the functioning of the line and L is the equilibrium probability that the
line is alive.
Solution The rate of a Bernoulli source emitting letter j = 1, 2, . . . with probability
p j is H = p j log p j . The state of the line is a DTMC with a 2 2 transition
j

126

Essentials of Information Theory

matrix
dead
live

1
1

and the equilibrium probabilities

1 L (dead) =

, L (live) =
+
+

(assuming that + > 0). The received signal sequence follows a DTMC with
states 0 (dead), 1, 2, . . . and transition probabilities
q0 j = p j ,
q00 = 1 ,
j, k 1.
q jk = (1 )pk
q j0 = ,
This chain has a unique equilibrium distribution

RS (0) =

, RS ( j) =
p j , j 1.
+
+

Then the information rate of the received signal equals

HRS = RS ( j)q jk log q jk
j,k0

(1 ) log(1 ) + p j log( p j )
=
+
j1

p j log + (1 ) pk log (1 )pk

+ j1
k1

= HL +
HS .
+
Here HL is the entropy rate of the line state DTMC:
HL =

(1 ) log(1 ) + log
+

(1 ) log(1 ) + log ,

and = /( + ).
Problem 1.25
Consider a Bernoulli source in which the individual character
can take value i with probability pi (i = 1, . . . , m). Let ni be the number of times the
character value i appears in the sequence u(n) = u1 u2 . . . un of given length n. Let
An be the smallest set of sequences u(n) which has total probability at least 1 .
Show that each sequence in An satisfies the inequality

ni log pi nh + nk/ )1/2 ,

1.6 Additional problems for Chapter 1

127

where k is a constant independent of n or . State (without proof) the analogous

assertion for a Markov source.
Solution For a Bernoulli source with letters 1, . . . , m, the probability of a given
string u(n) = u1 u2 . . . un is
P(U(n) = u(n) ) =

pni i .

1im

Set An consists of strings of maximal probabilities (selected in the decreasing

order), i.e. of maximal value of log P(U(n) = u(n) ) = ni log pi . Hence,
1im

@
u(n) : ni log pi c ,

An =

for some (real) c, to be determined. To determine c, we use that

P(An ) 1 .
Hence, c is the value for which

P u

(n)

< .

ni log pi c

1im

Now, for the random string U(n) = U1 . . . Un , let Ni is the number of appearances
of value i. Then

Ni log pi =

1im

j , where j = log pi when U j = i.

1 jn

Since entries U j are IID, so are random variables j . Next,

E j =

pi log pi := h

1im

and
2

Var j = E( j )2 E j =

pi (log pi )2

1im

Then

1 jn

j = nh and Var

pi log pi

1im

j = nv.

1 jn

Recall that h = H is the information rate of the source.

:= v.

128

Essentials of Information Theory

By Chebyshevs inequality, for all b > 0,

nv

P Ni log pi nh > b 2 ,

1im
b
$
and with b = nk/ , we obtain
?

nk

.
P Ni log pi nh >
1im

Therefore, for all u(n) An ,

ni log pi nh +

1im

nk
:= c.

For an irreducible and aperiodic Markov source the assertion is similar, with

H =

i pi j log pi j ,

1i, jm

1
and v 0 a constant given by v = lim sup Var
n n

j .

1 jn

Problem 1.26
Demonstrate that an efficient and decipherable noiseless coding
procedure leads to an entropy as a measure of attainable performance.
Words of length si (i = 1, . . . , n) in an alphabet Fa = {0, 1, . . . , a 1} are to be
n

chosen to minimise expected word-length pi si subject not only to decipherabiln

i=1

ity but also to the condition that qi si should not exceed a prescribed bound,
i=1

where qi is a feasible alternative to the postulated probability distribution {pi }

of characters in the original alphabet. Determine bounds on the minimal value of
n

pi si .

i=1

Solution If we disregard the condition that s1 , . . . , sn are positive integers, the minimisation problem becomes
minimise

si pi
i

subject to si 0 and asi 1 (Kraft).

This can be solved by the Lagrange method, with the Lagrangian

L (s1 , . . . , sn , ) =

1in

si pi

1in

asi .

(1.6.24)

1.6 Additional problems for Chapter 1

129

The solution of the relaxed problem is unique and given by

si = loga pi , 1 i n.

(1.6.25)

The relaxed optimal value vrel ,

vrel =

pi loga pi := h,

1in

provides a lower bound for the optimal expected word-length si pi :

h si pi .
i

Now consider the additional constraint

qi si b.

(1.6.26)

1in

The relaxed problem (1.6.24) complemented with (1.6.26) again can be solved by
the Lagrange method. Here, if
qi loga pi b
i

then adding the new constraint does not affect the minimiser (1.6.24), i.e. the
optimal positive s1 , . . . , sn are again given by (1.6.25), and the optimal value is h.
Otherwise, i.e. when qi loga pi > b, the new minimiser s1 , . . . , sn is still unique
i

(since the problem is still strong Lagrangian) and fulfils both constraints

= 1,

qi si = b.
i

In both cases, the optimal value vrel for the new relaxed problem satisfies h vrel .
Finally, the solution s1 , . . . , sn to the integer-valued word-length problem
minimise

si pi
i

subject to si 1 integer

and asi 1, qi si b
i

(1.6.27)

will satisfy
h vrel si pi ,
i

Problem 1.27

si qi b.
i

Suppose a discrete Markov source {Xt } has transition probability

p jk = P(Xt+1 = k|Xt = j)

130

Essentials of Information Theory

with equilibrium distribution ( j ). Suppose the letter can be obliterated by noise

(in which case one observes only the event erasure) with probability = 1 ,
independent of current or previous letter values or previous noise. Show that the
noise-corrupted source has information rate
log log 2 j s1 p jk log p jk ,
(s)

(s)

k s1

(s)

where p jk is the s-step transition probability of the original DTMC.

Solution Denote the corrupted source sequence {Xt }, with Xt = (a splodge) every
time there was an erasure. Correspondingly, a string x1n from the corrupted source
the oblitis produced from a string x1n of the original Markov source

by replacing

erated digits with splodges. The probability pn (x) = P X1n = x1n of such a string
is represented as

P(X1n = x1n )P(X1n |X1n = x1n )

(1.6.28)

x1n consistent with x1n

and is calculated as the product where the initial factor is

x1 or

y pyx s1 , where 1 < s n, or 1,

(s)

depending on where the initial non-obliterated digit occurred in x1n (if at all). The
subsequent factors contributing to (1.6.28) have a similar structure:
(s)

pxt1 xt or pxts xt s1 or 1.
Consequently, the information log pn (x1n ) carried by string x1n is calculated as
log P(Xs1 = xs1 ) (s1 1) log log
s s )
log px2s1 xs21 (s2 s1 1) log log
(sN sN1 )
log pxsN1
xsN (sN sN1 1) log log
where 1 s1 < < sN n are the consecutive times of appearance of nonobliterated symbols in x1n .

1
Now take log pn X1n , the information rate provided by the random string
n
X1n . Ignoring the initial bit, we can write

M(i, j; s)
N( )
N( )
1
(s)
log
log
log pi j .
log pn X1n =
n
n
n
n
i

1.6 Additional problems for Chapter 1

131

Here
N( ) = number of non-obliterated digits in X1n ,
N( ) = number of obliterated digits in X1n ,
M(i, j; s) = number of series of digits i j in X1n of length s + 1
As n , we have the convergence of the limiting frequencies (the law of large
numbers applies):
N( )
N( )
M(i, j; s)
(s)
,
,
s1 i pi j .
n
n
n
This yields

1
log pn X1n
n
(s)
(s)
log log 2 i s1 pi j log pi j ,
i, j

as required. [The convergence holds almost surely (a.s.) and in probability.]

According to the SCT, the limiting value gives the information rate of the corrupted
source.
Problem 1.28

A binary source emits digits 0 or 1 according to the rule

P(Xt = k|Xt1 = j, Xt2 = i) = qr ,

where k, j, i and r take values 0 or 1, r = k j i mod 2, and q0 +q1 = 1. Determine

the information rate of this source.
Also derive the information rate of a Bernoulli source emitting digits 0 and 1
with probabilities q0 and q1 . Explain the relationship between these two results.
Solution Re-write the conditional probabilities in a detailed form:
'
q0 , i = j,
P(Xt = 0|Xt1 = j, Xt2 = i) =
q , i = j,
' 1
q1 , i = j,
P(Xt = 1|Xt1 = j, Xt2 = i) =
q0 , i = j.
The source is a second-order Markov chain on {0, 1}, i.e. a DTMC with four states
{00, 01, 10, 11}. The 4 4 transition matrix is

00
q0 q1 0 0

01
0 0 q1 q0
10 q1 q0 0 0
11

q0 q1

132

Essentials of Information Theory

The equilibrium probabilities are uniform:

1
00 = 01 = 10 = 11 = .
4
The information rate is calculated in a standard way:
H =

, =0,1

=0,1

p , log p ,

and equals
1
h(q0 , q1 ) = q0 log q0 q1 log q1 .
4 ,
=0,1

Problem 1.29
An input to a discrete memoryless channel has three letters 1, 2
and 3. The letter j is received as ( j 1) with probability p, as ( j +1) with probability p and as j with probability 1 2p, the letters from the output alphabet ranging
from 0 to 4. Determine the form of the optimal input distribution, for general p,
as explicitly as possible. Compute the channel capacity in the three cases p = 0,
p = 1/3 and p = 1/2.
Solution The channel matrix is 3 5:

p (1 2p)
p
0
0
1
0
p
(1 2p)
p
0 .
2
3
0
0
p
(1 2p) p
The rows are permutations of each other, so the capacity equals
C = maxPX [h(Y ) h(Y
|X)]

= (maxPX h(Y )) + 2p log p + (1 2p) log(1 2p) ,

with the maximisation over the input-letter distribution PX applied only to h(Y ),
the entropy of the output-symbol.
Next,
h(Y ) =

PY (y) log PY (y),

y=0,1,2,3,4

where
PY (0)
PY (1)
PY (2)
PY (3)
PY (4)

=
=
=
=
=

PX (1)p,
PX (1)(1 2p) + PX (2)p,
PX (1)p + PX (2)(1 2p) + PX (3)p,
PX (3)(1 2p) + PX (2)p,
PX (3)p.

(1.6.29)

1.6 Additional problems for Chapter 1

133

The symmetry in (1.6.29) suggests that h(Y ) is maximised when PX (0) = PX (2) = q
and PX (1) = 1 2q. So:

max h(Y ) = max [2qp

log(qp) 2 q(1
2p) + (1 2q)p

log q(1 2p) + (1

2q)p

2qp + (1 2q)(1 2p) log 2qp + (1 2q)(1 2p) .

To find the maximum, differentiate and solve:

d
h(Y ) = 2p log(qp) 2p 2(1 4p) log q(1 2p) + (1 2q)p
dq

2(1 4p) (2p 2) log 2qp + (1

2q)(1 2p) (2p
2)
= 4p 2p log(qp) 2(1 4p)
log q(1 2p) + (1 2q)p

2(1 4p) 2(p 1) log 2qp + (1 2q)(1 2p) = 0.

For p = 0 we have a perfect error-free channel, of capacity log 3 which is
achieved when PX (1) = PX (2) = PX (3) = 1/3 (i.e. q = 1/3), and PY (1) = PY (2) =
PY (3) = 1/3, PY (0) = PY (4) = 0.
For p = 1/3, the output probabilities are
pY (0) = pY (4) = q/3, pY (1) = (1 q)/3, pY (2) = 1/3,
and h(Y ) simplifies to
q
q
1q
1q 1
1
h(Y ) = 2 log 2
log
log .
3
3
3
3
3
3

The derivative dh(Y ) dq = 0 becomes
q 2 2
1q 2
2
+
log + log
3
3 3 3
3
3
and vanishes when q = 1/2, i.e.
PX (1) = PX (3) = 1/2, PX (2) = 0,
PY (0) = PY (1) = PY (3) = PY (4) = 1/6, PY (2) = 1/3.
Next, the conditional entropy
h(Y |X) = log 3.
For the capacity this yields
1 1
1
2
2
C = log log log 3 = .
3
6 3
3
3
Finally, for p = 1/2, we have h(Y |X) = 1 and
q
1 2q
PY (0) = PY (4) = , PY (1) = PY (3) =
, PY (2) = q.
2
2

134

Essentials of Information Theory

The output entropy is

1 2q
q 1 2q
log
q log q
h(Y ) = q log
2
2
2
1 2q
1 2q
log
= q 2q log q
2
2
and is maximised when q = 1/6, i.e.
1
2
PX (1) = PX (2) = , PX (2) = ,
6
3
PY (0) = PY (4) =

1
1
1
, PY (1) = PY (3) = , PY (2) = .
12
3
6

The capacity in this case equals

1
C = log 3 .
2
Problem 1.30
A memoryless discrete-time channel produces outputs Y from
non-negative integer-valued inputs X by
Y = X,

where is independent of X , P( = 1) = p, P( = 0) = 1 p, and inputs are

restricted by the condition that EX 1.
By considering input distributions {ai , i = 0, 1, . . .} of the form ai = cqi ,
i = 1, 2, . . ., or otherwise, derive the optimal input distribution and determine an
expression for the capacity of the channel.
Solution The channel matrix is

1
0 0 ... 0
1 p p 0 . . . 0

1 p 0 p . . . 0 .

..
.. .. . . ..
. .
.
. .

For the input distribution with qi = P(X = i), we have that

P(Y = 0) = q0 + (1 p)(1 q0 ) = 1 p + pq0 ,
P(Y = i) = pqi , i 1,
whence
h(Y ) = (1 p + pq0 ) log(1 p + pq0 ) pqi log(pqi ).
i1

1.6 Additional problems for Chapter 1

135

With the conditional entropy being

h(Y |X) = (1 q0 ) (1 p) log(1 p) + p log p

the mutual entropy equals

I(Y : X) = (1 p + pq0 ) log(1 p + pq0 )
pqi log(pqi ) + (1 q0 )[(1 p) log(1 p) + p log p].
i1

We have to maximise I(Y : X) in q0 , q1 , . . ., subject to qi 0, i qi = 1, i iqi 1.

First, we fix q0 and maximise the sum i1 pqi log(pqi ) in qi with i 1. By
Gibbs, for all non-negative a1 , a2 , . . . with i1 ai = 1 q0 ,
qi log qi qi log ai , with equality iff qi ai .
i1

For ai = cd i with i iai = 1, the RHS becomes

(1 q0 ) log c log d iai = (1 q0 ) log c log d.
i1

From i icd i = 1, cd (1 d) = 1 a0 and d = a0 , c = (1 a0 )2 a0 .
Next, we maximise, in a0 [0, 1], the function
f (a0 ) = (1 p + pa0 ) log(1 p + pa0 ) p(1 a0 ) log
log a0 + (1 a0 )[(1 p) log(1 p) + p log p].

(1 a0 )2
a0

Requiring that
f (a0 ) = 0

(1.6.30a)

and
f (a0 ) =

p2
2p
p

0,
q + pa0 1 a0 a0

(1.6.30b)

one can solve equation (1.6.30a) numerically. Denote its root where (1.6.30b) holds
by a
0 . Then we obtain the following answer for the optimal input distribution:
'
a
i = 0,
0,
ai =
2 i1 , i 1.
(1 a
0 ) (a0 )
with the capacity C = f (a
0 ).
Problem 1.31 The representation known as binary-coded decimal encodes 0 as
0000, 1 as 0001 and so on up to 9, coded as 1001, with other 4-digit binary strings
being discarded. Show that by encoding in blocks, one can get arbitrarily near the
lower bound on word-length per decimal digit.
Hint: Assume all integers to be equally probable.

136

Essentials of Information Theory

Solution The code in question is obviously decipherable (and even prefix-free, as

is any decipherable code with a fixed codeword-length). The standard block-coding
procedure treats a string of n letters from the original source (Un ) operating with
(n)
an alphabet A as a letter from A n . Given joint probabilities pn (u1 ) = P(U1 =
i1 , . . . ,Un = in ), of the blocks in a typical message, we look at the binary entropy
h(n) =

P(U1 = i1 , . . . ,Un = in ) log P(U1 = i1 , . . . ,Un = in ).

i1 ,...,in

Denote by S(n) the random codeword-length while encoding in blocks. The mini1
mal expected word-length per source letter is en := min ES(n) . By Shannons NC
n
theorem,
h(n)
1
h(n)
en
+ ,
n log q
n log q n
where q is the size of the original alphabet A . We see that, for large n, en

h(n)
.
n log q

In the question, q = 10 and

h(n) = hn, where h = log 10 (equiprobability).
Hence, the minimal expected word-length en can be made arbitrarily close to 1.
Problem 1.32
Let {Ut } be a discrete-time process with values ut and let
(n)
P(u ) be the probability that a string u(n) = u1 . . . un is produced. Show that if
log P(U(n) ) n converges in probability to a constant then is the information
rate of the process.
Write down the formula for the information rate of an m-state DTMC and find
the rate when the transition matrix has elements p jk where
p j j = p, p j j+1 = 1 p ( j = 1, . . . , m 1), pm1 = 1 p.

Relate this to the information rate of a two-state source with transition probabilities
p and 1 p.
Solution The information rate of an m-state stationary DTMC with transition
matrix P = (pi j ) and an equilibrium (invariant) distribution = (i ) equals
h = i pi j log pi j .
i, j

If matrix P is irreducible (i.e. has a unique communicating class) then this statement holds for the chain with any initial distribution (in this case the equilibrium
distribution is unique).

1.6 Additional problems for Chapter 1

137

In the example, the transition matrix is

1 p
0
... 0

p
1 p . . . 0

..
..
. . ..

.
.
.
.
1 p
0
0... p
p
0
..
.

The rows are permutations of each other, and each of them has entropy
p log p (1 p) log(1 p).
The equilibrium distribution is = (1/m, . . . , 1/m):
1
1
1
pi j = (p + 1 p) = ,
m
m
1im m

and it is unique, as the chain has a unique communicating class. Therefore, the
information rate equals
h=

1
p log p (1 p) log(1 p) = p log p (1 p) log(1 p).

m 1im

p
1 p
For m = 2 we obtain precisely the matrix
, so with the equilib1 p
p
rium distribution = (1/2, 1/2) the information rate is again h = (p).
Problem 1.33 Define a symmetric channel and find its capacity.
A native American warrior sends smoke signals. The signal is coded in puffs of
smoke of different lengths: short, medium and long. One puff is sent per unit time.
Assume a puff is observed correctly with probability p, and with probability 1 p
(a) a short signal appears to be medium to the recipient, (b) a medium puff appears
to be long, and (c) a long puff appears to be short. What is the maximum rate at
which the warrior can transmit reliably, assuming the recipient knows the encoding
system he uses?
It would be more reasonable to assume that a short puff may disperse completely
rather than appear medium. In what way would this affect your derivation of a
formula for channel capacity?
Solution Suppose we use an input alphabet I , of m letters, to feed a memoryless
channel that produces symbols from an output alphabet J of size n (including
illegibles). The channel is described by its m n matrix where entry pi j gives the

138

Essentials of Information Theory

probability that letter i I is transformed to symbol j J . The rows of the

channel matrix form stochastic n-vectors (probability distributions over J ):

p11 . . . p1 j . . . p1n
..
..
..
..
..
.
.
.
.
.

pi1 . . . pi j . . . pin .

..
..
..
..
..
.
.
.
.
.
pm1 . . . pm j . . . pmn
The channel is called symmetric if its rows are permutations of each other (or, more
generally, have the same entropy E = h(pi1 , . . . , pin ), for all i I ). The channel
is said to be double-symmetric if in addition its columns are permutations of each
other (or, more generally, have the same column sum = pi j , for all j J ).
1im

For a memoryless channel, the capacity (the supremum of reliable transmission

rates) is given by
C = max I(X : Y ).
PX

Here, the maximum is taken over PX = (PX (i), i I ), the input-letter probability distribution, and I(X : Y ) is the mutual entropy between the input and output
random letters X and Y tied through the channel matrix:
I(X : Y ) = h(Y ) h(Y |X) = h(X) h(X|Y ).
For the symmetric channel, the conditional entropy
h(Y |X) = PX (i)pi j log pi j h,
i, j

regardless of the input probabilities pX (i). Hence,

C = max h(Y ) h(Y |X),
PX

and the maximisation needs only to be performed for the output symbol entropy
h(Y ) = PY ( j) log PY ( j), where PY ( j) = PX (i)pi j .
j

For a double-symmetric channel, the latter problem becomes straightforward: h(Y )

eq
is maximised by the uniform input equidistribution PX (i) = 1/m, as in this case PY
is also uniform:
1
1
PY ( j) = pi j = as it doesnt depend on j J .
m i
m
Thus, for the double-symmetric channel:
C = log n h(Y |X).

1.6 Additional problems for Chapter 1

139

p
0
0
1 p
0
p 1 p
0 ;
1 p 0
p
0
column 4 corresponds to a no-signal output state (a splodge). The maximisation
problem loses its symmetry:

max

PX (i)pi j log
PX (i)pi j
j=1,2,3,4 i=1,2,3
i=1,2,3

PX (i) pi j log pi j
i=1,2,3

subject to

j=1,2,3,4

PX (1), PX (2), PX (3) 0, and

PX (i) = 1,

i=1,2,3

and requires a full-scale analysis.

Problem 1.34 The entropy power inequality (EPI, see (1.5.10)) states: for X and
Y independent d -dimensional random vectors,
22h(X+Y)/d 22h(X)/d + 22h(Y)/d ,

(1.6.31)

with equality iff X and Y are Gaussian with proportional covariance matrices.
Let X be a real-valued random variable with a PDF fX and finite differential
entropy h(X), and let function g : R R have strictly positive derivative g everywhere. Prove that the random variable g(X) has differential entropy satisfying
h(g(X)) = h(X) + E log2 g (X),

assuming that E log2 g (X) is finite.

Let Y1 and Y2 be independent, strictly positive random variables with densities.
Show that the differential entropy of the product Y1Y2 satisfies
22h(Y1Y2 ) 1 22h(Y1 ) + 2 22h(Y2 ) ,

where log2 (1 ) = 2E log2 Y2 and log2 (2 ) = 2E log2 Y1 .

140

Essentials of Information Theory

Solution The CDF of the random variable g(X) satisfies

Fg(X) (y) = P(g(X) y) = P(X g1 (y)) = FX (g1 (y)),
i.e. the PDF fg(X) (y) =

dFg(X) (y)
takes the form
dy

1 1
fX g1 (y)
fg(X) (y) = fX g (y) g (y) = 1 .
g g (y)
Then
h(g(X)) =

fg(X) (y) log2 fg(X) (y)dy

0
fX g1 (y)
fX g1 (y)
log2
dy

=
g g1 (y)
g g1 (y)
0

fX (x)

log
f
(x)

log
g
(x)
g (x)dx
=
X
2
2
g (x)

= h(X) + E log2 g (X) .

When g(t) = et then

log2 g (t) = log2 et = t log2 e.
So, Yi = eXi = g(Xi ) and (1.6.32) implies
h(eXi ) = h(g(Xi )) = h(Xi ) + EXi log2 e, i = 1, 2, 3,
with X3 = X1 + X2 . Then

h(Y1Y2 ) = h(eX1 +X2 ) = h(X1 + X2 ) + EX1 + EX2 log2 e.
Hence, in the entropy-power inequality,
2h(X1 +X2 )+2(EX1 +EX2 ) log2 e
22h(Y1Y2 ) = 2

2h(X )
2 1 + 22h(X2 ) 22(EX1 +EX2 ) log2 e
= 22EX2 log2 e 22[h(X1 )+EX1 log2 e]
+22EX1 log2 e 22[h(X2 )+EX2 log2 e]
= 1 22h(Y1 ) + 2 22h(Y2 ) .

Here 1 = 22EX2 log2 e , i.e.

log2 1 = 2EX2 log2 e = 2E lnY2 log2 e = 2E log2 Y2 ,
and similarly, log2 1 = 2E log2 Y1 .

(1.6.32)

1.6 Additional problems for Chapter 1

Problem 1.35
0 < a < b:

141

In this problem we work with the following functions defined for

G(a, b) =

ab, L(a, b) =

1
ba
, I(a, b) = (bb /aa )1/(ba) .
log(b/a)
e

Check that
0 < a < G(a, b) < L(a, b) < I(a, b) < A(a, b) =

a+b
< b.
2

(1.6.33)

Next, for 0 < a < b define

(a, b) = L(a, b)I(a, b)/G2 (a, b).

Let p = (pi ) and q = (qi ) be the probability distributions of random variables X

and Y :
P(X = i) = pi > 0, P(Y = i) = qi , i = 1, . . . , r, pi = qi = 1.

Let m = min[qi /pi ], M = max[qi /pi ], = min[pi ], = max[pi ]. Prove the following
bounds for the entropy h(X) and KullbackLeibler divergence D(p||q) (cf. PSE II,
p. 419):
0 log r h(X) log ( , ).

(1.6.34)

0 D(p||q) log (m, M).

(1.6.35)

Solution The inequality (1.6.33) is straightforward and left as an exercise. For

a xi b, set A (p, x) = pi xi , G (p, x) = xipi . The following general inequality
holds:
A (p, x)
1
(a, b).
(1.6.36)
G (p, x)
It implies that
0 log

pi xi

pi log xi log (a, b).

Selecting xi = qi /pi we immediately obtain (1.6.35). Taking q to be uniform, we

obtain (1.6.34) from (1.6.35) since

1 1
1 1
,
,
=
= ( , ).

r r

Next, we sketch the proof of (1.6.36); see details in [144], [50]. Let f be a convex
function, p, q 0, p + q = 1. Then for xi [a, b], we have

(1.6.37)
0 pi f (xi ) f pi xi max[p f (a) + q f (b) f (pa + qb)].
p

142

Essentials of Information Theory

Applying (1.6.37) for a convex function f (x) = log x we obtain after some calculations that the maximum in (1.6.37) is achieved at p0 = (b L(a, b))/(b a),
with p0 a + (1 p0 )b = L(a, b), and
ba
A (p, x)
log(bb /aa )
0 log
log
log(ab) +
1
G (q, x))
log(b/a)
ba
which is equivalent to (1.6.36). Finally, we establish (1.6.37). Write xi = i a +
(1 i )b for some i [0, 1]. Then by convexity
0 pi f (xi ) f ( pi xi )
pi (i f (a) + (1 i ) f (b)) f (a pi i + b pi (1 i )) .
Denoting pi i = p and 1 pi i = q and maximising over p we obtain
(1.6.37).
Problem 1.36 Let f be a strictly positive probability density function (PDF)
on the line R, define the KullbackLeibler divergence D(g|| f ) and prove that
D(g|| f ) 0.
0
0
ex f (x)dx < and

Next, assume that

mum of the expression

over the PDFs g with

|x|ex f (x)dx < . Prove that the mini-

xg(x)dx + D(g|| f )

(1.6.38)

|x|g(x)dx < is attained at the unique PDF g ex f (x)

and calculate this minimum.

Solution The KullbackLeibler divergence D(g|| f ) is defined by

0
0
g(x)
g(x)

dx <
dx, if
D(g|| f ) = g(x) ln
g(x) ln
f (x)
f (x)
and
D(g|| f ) = , if

g(x)
dx = .

g(x) ln
f (x)

The bound D(g|| f ) 0 is the Gibbs inequality.

Now take the PDF

g (x)

ex f (x)/Z

where Z =

ez f (z)dz. Set W =

xex f (x)dx; then W /Z = xg (x)dx. Further, write:

1
ex
ex f (x) ln dx
Z0
Z
W

1
1
x
=
e f (x) x ln Z dx = W Z ln Z = ln Z
Z
Z
Z

D(g || f ) =

1.6 Additional problems for Chapter 1

and obtain that

143

xg (x)dx + D(g || f ) = ln Z.

This is the claimed minimum in the0 last part of the question.

Indeed, for any PDF g such that

|x|g(x)dx < , set q(x) = g(x)/ f (x) and write

g(x)
x
dx
=
q(x)
ln
q(x)e
Z
f (x)dx
g (x)
0
0
= x f (x)q(x)dx + f (x)q(x) ln q(x)dx + ln Z

D(g||g ) =

g(x) ln

xg(x)dx + D(g|| f ) + ln Z,

implying that

xg(x)dx + D(g|| f ) =

xg (x)dx + D(g || f ) + D(g||g ).

Since D(g||g ) > 0 unless g = g , the claim follows.

Remark 1.6.1 The property of minimisation of (1.6.38) is far reaching and important in a number of disciplines, including statistical physics, ergodic theory and
financial mathematics. We refer the reader to the paper [109] for further details.

2
Introduction to Coding Theory

2.1 Hamming spaces. Geometry of codes. Basic bounds

on the code size
For presentational purposes, it is advisable to concentrate at the first reading of this
section on the binary case where the symbols sent through a channel are 0 and 1.
As we saw earlier, in the case of an MBSC with the row error-probability p
(N)
(0, 1/2), the ML decoder looks for a codeword x that has the maximum number
of digits coinciding with the received binary word y(N) . In fact, if y(N) is received,
the ML decoder compares the probabilities

(N) (N)
(N) (N)
P y(N) |x(N) = p (x ,y ) (1 p)N (x ,y )

(x(N) ,y(N) )
p
N
= (1 p)
1 p
for different binary codewords x(N) . Here

(x(N) , y(N) ) = the number of digits i with xi = yi

(2.1.1a)

is the so-called Hamming distance between words x(N) = x1 . . . xN and y(N) =

y1 . . . yN . Since the first factor (1 p)N does not depend on x(N) , the decoder seeks
to maximise the second factor, that is to minimise (x(N) , y(N) ) (as 0 < p/(1 p) <
1 for p (0, 1/2)).
The definition (2.1.1a) of Hamming distance can be extended to q-ary strings.
The space of q-ary words HN,q = {0, 1, . . . , q 1}N (the Nth Cartesian power of
set Jq = {0, 1, . . . , q 1}) with distance (2.1.1a) is called the q-ary Hamming space
of length N. It contains qN elements. In the binary case, HN,2 = {0, 1}N .
144

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

N=1

N=2

N=3

145

N=4

Figure 2.1

An important part is played by the distance (x(N) , 0(N) ) between words x(N) =
x1. . . xN and 0(N) = 0 . . . 0; it is called the weight of word x(N) and denoted by
w x(N) :

(2.1.1b)
w x(N) = the number of digits i with xi = 0.
Lemma 2.1.1

The quantity (x(N) , y(N) ) defines a distance on HN,q . That is:

(i) 0 (x(N) , y(N) ) N and (x(N) , y(N) ) = 0 iff x(N) = y(N) .

(ii) (x(N) , y(N) ) = (y(N) , x(N) ).
(iii) (x(N) , z(N) ) (x(N) , y(N) ) + (y(N) , z(N) ) (the triangle inequality).
Proof The proof of (i) and (ii) is obvious. To check (iii), observe that any digit i
with zi = xi has either yi = xi and then counted in (x(N) , y(N) ) or zi = yi and then
counted in (y(N) , z(N) ).
Geometrically, the binary Hamming space HN,2 may be identified with the collection of the vertices of a unit cube in N dimensions. The Hamming distance
equals the lowest number of edges we have to pass from one vertex to another. It is
a good practice to plot pictures for relatively low values of N: see Figure 2.1.
An important role is played below by geometric and algebraic properties of the
Hamming space. Namely, as in any metric space, we can consider a ball of a given
radius R around a given word x(N) :
BN,q (x(N) , R) = {y(N) HN,q : (x(N) , y(N) ) R}.

(2.1.2)

An important (and hard) problem is to calculate the maximal number of disjoint

balls of a given radius which can be packed in a given Hamming space.
Observe that words admit an operation of addition mod q:
x(N) + y(N) = (x1 + y1 ) mod q . . . (xN + yN ) mod q .

(2.1.3a)

146

Introduction to Coding Theory

This makes the Hamming space HN,q a commutative group, with the zero codeword 0(N) = 0 . . . 0 playing the role of the zero of the group. (Words also may be
multiplied which generates a powerful apparatus; see below.)
For q = 2, we have a two-point code alphabet {0, 1} that is actually a twopoint field, F2 , with the following arithmetic: 0 + 0 = 1 + 1 = 0 1 = 1 0 = 0,
0 + 1 = 1 + 0 = 1 1 = 1. (Recall, a field is a set equipped with two commutative
operations: addition and multiplication, satisfying standard axioms of associativity
and distributivity.) Thus, each point in the binary Hamming space HN,2 is opposite
to itself: x(N) + x (N) = 0(N) iff x(N) = x (N) . In fact, HN,2 is a linear space over the
coefficient field F2 , with 1 x(N) = x(N) , 0 x(N) = 0(N) .
Henceforth, all additions of q-ary words are understood digit-wise and mod q.
Lemma 2.1.2
tions:

The Hamming distance on HN,q is invariant under group transla (x(N) + z(N) , y(N) + z(N) ) = (x(N) , y(N) ).

(2.1.3b)

Proof For all i = 1, . . . , N and xi , yi , zi {0, 1, . . . , q 1}, the digits xi + zi mod q

and yi + zi mod q are in the same relation (= or =) as digits xi and yi .
A code is identified with a set of codewords XN HN,q ; this means that we disregard any particular allocation of codewords (which fits the assumption that the
source messages are equidistributed). An assumption is that the code is known to
both the sender and the receiver. Shannons coding theorems guarantee that, under
certain conditions, there exist asymptotically good codes attaining the limits imposed by the information rate of a source and the capacity of a channel. Moreover,
Shannons SCT shows that almost all codes are asymptotically good. However, in
a practical situation, these facts are of a limited use: one wants to have a good code
in an explicit form. Besides, it is desirable to have a code that leads to fast encoding
and decoding and maximises the rate of the information transmission.
So, assume that the source emits binary strings u(n) = u1 . . . un , ui = 0, 1. To
obtain the overall error-probability vanishing as n , we have to encode words
u(n) by longer codewords x(N) HN,2 where N R1 n and 0 < R < 1. Word x(N)
is then sent to the channel and is transformed into another word, y(N) HN,2 . It
is convenient to represent the error occurred by the difference of the two words:
e(N) = y(N) x(N) = x(N) + y(N) , or equivalently, write y(N) = x(N) + e(N) , in the
sense of (2.1.3a). Thus, the more digits 1 the error word e(N) has, the more symbols are distorted by the channel. The ML decoder then produces a guessed code(N)
word x that may or may not coincide with x(N) , and then reconstructs a string
(n)
u . In the case of a one-to-one encoding rule, the last procedure is (theoretically)
straightforward: we simply invert the map u(n) x(N) . Intuitively, a code is good

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

147

if it allows the receiver to correct the error string e(N) , at least when word e(N)
does not contain too many non-zero digits.
Going back to an MBSC with the row probability of the error p < 1/2: the ML
(N)
decoder selects a codeword x that leads to a word e(N) with a minimal number
of the unit digits. In geometric terms:
(N)

XN is the codeword closest to y(N)

in the Hamming distance .

(2.1.4)

The same rule can be applied in the q-ary case: we look for the codeword closest
to the received string. A drawback of this rule is that if several codewords have the
same minimal distance from a received word we are stuck. In this case we either
choose one of these codewords arbitrarily (possibly randomly or in connection with
the messages content; this is related to the so-called list decoding), or, when a high
quality of transmission is required, refuse to decode the received word and demand
a re-transmission.
Definition 2.1.3 We call N the length of a binary code XN , M := XN the size
log2 M
and :=
the information rate. A code XN is said to be D-error detecting
N
if making up to D changes in any codeword does not produce another codeword,
and E-error correcting if making up to E changes in any codeword x(N) produces
a word which is still (strictly) closer to x(N) than to any other codeword (that is,
x(N) is correctly guessed from a distorted word under the rule (2.1.4)). A code has
minimal distance (or briefly distance) d if

(N)
(N)
(N)
.
(2.1.5)
d = min (x(N) , x ) : x(N) , x XN , x(N) = x
The minimal distance and the information rate of a code XN will be sometimes
denoted by d(XN ) and (XN ), respectively.
This definition can be repeated almost verbatim for the general case of a
logq M
. Namely, a code XN
q-ary code XN HN,q , with information rate =
N
is called E-error correcting if, for all r = 1, . . . , E, x(N) XN and y(N) HN,q
with (x(N) , y(N) ) = r, the distance (y(N) , x (N) ) > r for all x (N) XN such that
x (N) = x(N) . In words, it means that making up to E errors in a codeword produces a word that is still closer to it than to any other codeword. Geometrically,
this property means that the balls of radius E about the codewords do not intersect:
BN,q (x(N) , E) BN,q (x

(N)

, E) = 0/ for all distinct x(N) , x

(N)

XN .

Next, a code XN is called D-error detecting if the ball of radius D about a codeword
does not contain another codeword. Equivalently, the intersection BN,q (x(N) , D)
XN is reduced to a single point x(N) .

148

Introduction to Coding Theory

A code of length N, size M and minimal distance d is called an [N, M, d] code.

Speaking of an [N, M] or [N, d] code, we mean any code of length N and size M or
minimal distance d.
To make sure we understand this definition, let us prove the aforementioned
equivalence in the definition of an E-error correcting code. First, assume that the
balls of radius E are disjoint. Then, making up to E changes in a codeword produces a word that is still in the corresponding ball, and hence is further apart from
any other codeword. Conversely, let our code have the property that changing up
to E digits in a codeword does not produce a word which lies at the same distance
from or closer to another codeword. Then any word obtained by changing precisely
E digits in a codeword cannot fall in any ball of radius E but in the one about the
original codeword. If we make fewer changes we again do not fall in any other ball,
for if we do, then moving towards the second centre will produce, sooner or later,
a word that is at distance E from the original codeword and at distance < E from
the second one, which is impossible.
2
For a D-error detecting code, the distance 8d D + 1.9 Furthermore, a code of
distance d detects d 1 errors, and it corrects (d 1)/2 errors.
Remark 2.1.4 Formally, Definition 2.1.3 means that a code detects at least D
and corrects at least E errors, and some authors make a point of this fact, specifying D and E as maximal values with the above properties. We followed an original
tradition where the detection and correction abilities of codes are defined in terms
of inequalities rather than equalities, although in a number of forthcoming constructions and examples the claim that a code detects D and/or corrects E errors
means that D and/or E and no more. See, for instance, Definition 2.1.7.
Definition 2.1.5 In Section 2.3 we systematically study so-called linear codes.
The linear structure is established in space HN,q when the alphabet size q is of
the form ps where p is a prime and s a positive integer; in this case the alphabet
{0, 1, . . . , q 1} can be made a field, Fq , by introducing two suitable operations:
addition and multiplication. See Section 3.1. When s = 1, i.e. q is a prime number,
then both operations can be understood as standard ones, modulo q. When Fq is a
field with addition + and multiplication , set HN,q = FN
q becomes a linear space
over Fq , with component-wise addition and multiplication by scalars generated
by the corresponding operations in Fq . Namely, for x(N) = x1 . . . xN , y(N) = y1 . . . yN
and Fq ,
x(N) + y(N) = (x1 + y1 ) . . . (xN + yN ), x(N) = ( x1 ) . . . ( xN ).

(2.1.6a)

With q = ps , a q-ary [N, M, d] code XN is called linear if it is a linear subspace of

HN,q . That is, XN has the property that if x(N) , y(N) XN then x(N) + y(N) XN

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

149

and x(N) XN for all Fq . For a linear code X , the size M is given by
M = qk where k may take values 1, . . . , N and gives the dimension of the code, i.e.
the maximal number of linearly independent codewords. Accordingly, one writes
k = dim X . As in the usual geometry, if k = dim X then in X there exists a basis
of size k, i.e. a linearly independent collection of codewords x(1) , . . . , x(k) such that
any codeword x X can be (uniquely) written as a linear combination a j x( j) ,
1 jk

where a j Fq . [In fact, if k = dim X then any linearly independent collection of k

codewords is a basis in X .] In the linear case, we speak of [N, k, d] or [N, k] codes.
As follows from the definition, a linear [N, k, d] code XN always contains the
zero string 0(N) = 0 . . . 0. Furthermore, owing to property (2.1.3b),
(N)the
minimal
of a non-0
distance d(XN ) in a linear code X equals the minimal weight w x
codeword x(N) XN . See (2.1.1b).
Finally, we define the so-called wedge-product of codewords x and y as a word
w = x y with components
wi = min[xi , yi ],

i = 1, . . . , N.

(2.1.6b)

A number of properties of linear codes can be mentioned already in this section,

although some details of proofs will be postponed.
A simple example of a linear code is a repetition code RN HN,q , of the form
(
)
RN = x(N) = x . . . x : x = 0, 1, . . . , q 1
A
B
N 1
detects N 1 and corrects
errors. A linear parity-check code
2
(
)
PN = x(N) = x1 . . . xN : x1 + + xN = 0
detects a single error only, but does not correct it.
Observe that the volume of the ball in the Hamming space HN,q centred at z(N)
is
vN,q (R) = BN,q (z

(N)

N
, R) =
(q 1)k ;
k
0kR

(2.1.7)

it does not depend on the choice of the centre z(N) HN,q .

It is interesting to consider large values of N (theoretically, N ), and analyse
log X
and
parameters of the code XN such as the information rate (X ) =
N
d(X )

. Our aim is to focus on good codes, with

the distance per digit d(X
)=
N
many codewords (to increase the information rate) and large distances (to increase

150

Introduction to Coding Theory

the detecting and correcting abilities). From this point of view, it is important to
understand basic bounds for codes.
Upper bounds are usually written for Mq (N, d), the largest size of a q-ary code
of length N and distance d. We begin with elementary facts: Mq (N, 1) = qN ,
Mq (N, N) = q, Mq (N, d) qMq (N 1, d) and in the binary case M2 (N, 2s) =
M2 (N 1, 2s 1) (easy exercises).
Indeed, the number of the codewords cannot be too high if we want to keep
good an error-detecting and error-correcting ability. There are various bounds for
parameters of codes; the simplest bound was discovered by Hamming in the late
1940s.
Theorem 2.1.6

(The Hamming bound)

(i) If a q-ary code XN corrects E errors then its size M = XN obeys

M qN vN,q (E) .

(2.1.8a)

For a linear [N, k] code this can be written in the form

N k logq vN,q (E) .
8
9
(ii) Accordingly, with E = (d 1)/2 ,

Mq (N, d) qN vN,q (E).

(2.1.8b)

Proof (i) The E-balls about the codewords x(N) XN must be disjoint. Hence,
the total number of points covered equals the product vN,q (E)M which should not
exceed qN , the cardinality of the Hamming space HN,q .
8(ii) Likewise,
9 if XN is an [N, M, d] code then, as was noted above, for E =
(d 1)/2 , the balls BN,q (x(N) , E), x(N) XN , do not intersect. The volume
BN,q (x(N) , E) is given by

N
(q 1)k ,
vN,q (E) =
k
0kE
and the union of balls

BN,q (x(N) , E)

x(N) XN

must lie in HN,q , again with cardinality HN,q = qN .

We see that the problem of finding good codes becomes a geometric problem,
because a good code XN correcting E errors must give a close-packing of the
Hamming space by balls of radius E. A code XN that gives a true close-packing

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

151

partition has an additional advantage: the code not only corrects errors, but never
leads to a refusal of decoding. More precisely:
Definition 2.1.7 An E-error correcting code XN of size XN = M is called
perfect when the equality is achieved in the Hamming bound:

M = qN vN,q (E) .
If a code XN is perfect, every word y(N) HN,q belongs to a (unique) ball
BE (x(N) ). That is, we are always able to decode y(N) by a codeword: this leads
to the correct answer if the number of errors is E, and to a wrong answer if it is
> E. But we never get stuck in the case of decoding.
The problem of finding perfect binary codes was solved about 20 years ago.
These codes exist only for
(a) E = 1: here N = 2l 1, M = 22 1l , and these codes correspond to the so-called
Hamming codes;
(b) E = 3: here N = 23, M = 212 ; they correspond to the so called (binary) Golay
code.
l

Both the Hamming and Golay codes are discussed below. The Golay code is
used (together with some modifications) in the US space programme: already in
the 1970s the quality of photographs encoded by this code and transmitted from
Mars and Venus was so excellent that it did not require any improving procedure.
In the former Soviet Union space vessels (and early American ones) other codes
were also used (and we also discuss them later): they generally produced lowerquality photographs, and further manipulations were required, based on statistics
of the pictorial images.
If we consider non-binary codes then there exists one more perfect code, for
three symbols (also named after Golay).
We will now describe a number of straightforward constructions producing new
codes from existing ones.
Example 2.1.8

Constructions of new codes include:

(i) Extension: You add a digit xN+1 to each codeword x(N) = x1 . . . xN from
code XN , following an agreed rule. Viz., the so-called parity-check extension requires that xN+1 + x j = 0 in the alphabet field Fq . Clearly, the
1 jN

has the same size as the original code XN , and the

extended code,
+
) is equal to either d(XN ) or d(XN ) + 1.
distance d(XN+1
+
,
XN+1

152

Introduction to Coding Theory

(ii) Truncation: Remove a digit from the codewords x X (= XN ). The result

ing code, XN1
, has length N 1 and, if the distance d(XN ) 2, the same

) d(XN ) 1, provided that d(XN ) 2.

size as XN , while d(XN1
(iii) Purge: Simply delete some codewords x XN . For example, in the binary
case removing all codewords with an odd number of non-zero digits from a
linear code leads to a linear subcode; in this case if the distance of the original
code was odd then the purged code will have a strictly larger distance.
(iv) Expansion: Opposite to purging. Say, let us add the complement of each
codeword to a binary code XN , i.e. the N-word where the 1s are replaced by
the 0s and vice versa. Denoting the expanded code by X N one can check
that d(X N ) = min[d(XN ), N d(XN )] where
d(XN ) = max[ (x(N) , x

(N)

) : x(N) , x

(N)

XN ].

(v) Shortening: Take all codewords x(N) XN with the ith digit 0, say, and
delete this digit (shortening on xi = 0). In this way the original binary linsh,0
ear [N, M, d] code XN is reduced to a binary linear code XN1
(i) of length
N 1, whose size can be M/2 or M and distance d or, in a trivial case, 0.
(vi) Repetition: Repeat each codeword x(= x(N) ) XN a fixed number of times,
say m, producing a concatenated (Nm)-word
xx . . . x. The result is a code
re
re
XNm , of length Nm and distance d XNm = md(XN ).
(vii) Direct sum: Given two codes XN and XN , form a code X + X =
{xx : x X , x X }. Both the repetition and direct-sum constructions
are not very effective and neither is particularly popular in coding (though
we will return to these constructions in examples and problems). A more
effective construction is
(viii) The bar-product (x|x + x ): For the [N, M, d] and [N, M , d ] codes XN and
XN define a code XN |XN of length 2N as the collection
)
(
(N)
x(x + x ) : x(= x(N) ) XN , x (= x ) XN .
That is, each codeword in X |X is a concatenation of the codeword from
XN and its sum with a codeword from XN (formally, neither of XN , XN in
this construction is supposed to be linear). The resulting code is denoted by
XN |XN ; it has size

XN |XN = XN XN .
A useful exercise is to check that the distance

d XN |XN = min 2d(XN ), d(X N ) .

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

153

(ix) The dual code. The concept of duality is based on the inner dot-product in
space HN,q (with q = ps ): for x = x1 . . . xN and y = y1 . . . yN ,
D
C
x(N) y(N) = x1 y1 + + xN yN
which yields a value from field Fq . For a linear [N, k] code XN its dual, XN ,
is a linear [N, N k] code defined by
D
)
(
C
XN = y(N) HN,q : x(N) y(N) = 0 for all x XN .
(2.1.9)
Clearly, (XN ) = XN . Also dim XN + dim XN = N. A code is called selfdual if XN = XN .
Worked Example 2.1.9
(a) Prove that if the distance d of an [N, M, d] code XN is an odd number then the
code may be extended to an [N + 1, M] code X + with distance d + 1.
(b) Show an E -error correcting code XN can be extended to a code X + that detects 2E + 1 errors.
(c) Show that the distance of a perfect binary code is an odd number.
Solution (a) By adding the digit xN+1 to the codewords x = x1 . . . xN of an [N, M]
code XN so that xN+1 = x j , we obtain an [N + 1, M] code X + . If the distance
1 jN

d of XN was odd, the distance of X + is d +1. In fact, if a pair of codewords, x, x

X , had (x, x ) > d, then the extended codewords, x+ and x + , have (x+ , x + )
(x, x ) > d. Otherwise, i.e. if (x, x ) = d, the distance increases: (x+ , x + ) =
d + 1.
(b) The distance d of an E-error correcting code is strictly greater than 2E. Hence,
the above extension gives a code with distance strictly greater than 2E + 1.
(c) For a perfect E-error correcting code the distance is at most 2E + 1 and hence
equals 2E + 1.
Worked Example 2.1.10 Show that there is no perfect 2-error correcting code
of length 90 and size 278 over F2 .
Solution We might be interested in the existence of a perfect 2-error correcting
binary code of length N = 90 and size M = 278 because
v90,2 (2) = 1 + 90 +

90 89
= 4096 = 212
2

and
M v90,2 (2) = 278 212 = 290 = 2N .

154

Introduction to Coding Theory

However, such a code does not exist. Assume that it exists, and, the zero word
0 = 0 . . . 0 is a codeword. The code must have d = 5. Consider the 88 words with
three non-zero digits, with 1 in the first two places:
1110 . . . 00 ,

1101 . . . 00 ,

... ,

110 . . . 01 .

(2.1.10)

Each of these words should be at distance 2 from a unique codeword. Say, the
codeword for 1110 . . . 00 must contain 5 non-zero digits. Assume that it is
111110 . . . 00.
This codeword is at distance 2 from two other subsequent words,
11010 . . . 00

and

11001 . . . 00 .

Continuing with this construction, we see that any word from list (2.1.10) is attracted to a codeword with 5 non-zero digits, along with two other words from
(2.1.10). But 88 is not divisible by 3.
Let us continue with bounds on codes.
Theorem 2.1.11 (The GilbertVarshamov (GV) bound) For any q 2 and d 2,
there exists a q-ary [N, M, d] code XN such that

M = XN qN vN,q (d 1) .
(2.1.11)
Proof Consider a code of maximal size among the codes of minimal distance
d and length N. Then any word y(N) HN,q must be distant d 1 from some
codeword: otherwise we can add y(N) to the code without changing the minimal
distance. Hence, the balls of radius d 1 about the codewords cover the whole
Hamming space HN,q . That is, for the code of maximal size, XNmax ,

XNmax vN,q (d 1) qN .

As was listed before, there are ways of producing one code from another (or from
a collection of codes). Let us apply truncation and drop the last digit xN in each
codeword x(N) from an original code XN . If code XN had the minimal distance

, has the minimal distance d 1 and the same

d > 1 then the new code, XN1
size as XN . The truncation procedure leads to the following bound.
Theorem 2.1.12
tance d has

(The Singleton bound) Any q-ary code XN with minimal disM = XN Mq (N, d) qNd+1 .

(2.1.12)

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

155

Proof As before, perform a truncation on an [N, M, d] code XN : drop the last

digit from each codeword x XN . The new code is [N, M, d ] where d d 1.
Repeating this procedure d 1 times gives an (N d + 1) code of the same
size M and distance 1. This code must fit in Hamming space HNd+1,q with
HNd+1,q = qNd+1 ; hence the result.
As with the Hamming bound, the case of equality in the Singleton bound attracted a special interest:
Definition 2.1.13 A q-ary linear [N, k, d] code is called maximum distance separating (MDS) if it gives equality in the Singleton bound:
d = N k + 1.

(2.1.13)

We will see below that, similarly to perfect codes, the family of the MDS codes
is rather thin.
Corollary 2.1.14
distance d then

If Mq (N, d) is the maximal size of a code XN with minimal

N
qN
q

, qNd+1 .
Mq (N, d) min
vN,q (d 1)
vN,q (d 1)/2

(2.1.14)

From now on we will omit indices N and (N) whenever it does not lead to
confusion. The upper bound in (2.1.14) becomes too rough when d N/2. Say, in
the case of binary [N, M, d]-code with N = 10 and d = 5, expression (2.1.14) gives
the upper bound M2 (10, 5) 18, whereas in fact there is no code with M 13, but
there exists a code with M = 12. The codewords of the latter are as follows:
0000000000, 1111100000, 1001011010, 0100110110,
1100001101, 0011010101, 0010011011, 1110010011,
1001100111, 1010111100, 0111001110, 0101111001.
The lower bound gives in this case the value 2 (as 210 /v10,2 (4) = 2.6585) and is
also far from being satisfactory. (Some better bounds will be obtained below.)
Theorem 2.1.15 (The Plotkin bound) For a binary code X of length N and
distance d with N < 2d , the size M obeys
A
B
d
M = X 2
.
(2.1.15)
2d N

156

Introduction to Coding Theory

Proof The minimal distance cannot exceed the average distance, i.e.
M(M 1)d

x X

(x, x ).

On the other hand, write code X as an M N matrix with rows as codewords.

Suppose that column i of the matrix contains si zeros and M si ones. Then

x X

(x, x ) 2

si (M si ).

(2.1.16)

1iN

If M is even, the RHS of (2.1.16) is maximised when si = M/2 which yields

2d
1
.
M(M 1)d NM 2 , or M
2
2d N
As M is even, this implies
B
A
d
.
M2
2d N
If M is odd, the RHS of (2.1.16) is N(M 2 1)/2 which yields
M
This implies in turn that

2d
N
=
1.
2d N 2d N

B
A
B
2d
d
M
1 2
,
2d N
2d N

because, for all x > 0, 2x 2x + 1.

Theorem 2.1.16
for any N and d ,

Let M2 (N, d) be the maximal size of a binary [N, d] code. Then,

M2 (N, 2d 1) = M2 (N + 1, 2d),

(2.1.17)

2M2 (N 1, d) = M2 (N, d).

(2.1.18)

and
Proof To prove (2.1.17) let X be a code of length N, distance 2d 1 and size
M2 (N, 2d 1). Take its parity-check extension X + . That is, add digit xN+1 to
N+1

every codeword x = x1 . . . xN so that

xi = 0. Then X + is a code of length N + 1,

i=1

the same size M2 (N, 2d 1) and distance 2d. Therefore,

M2 (N, 2d 1) M2 (N + 1, 2d).
Similarly, deleting the last digit leads to the inverse:
M2 (N, 2d 1) M2 (N + 1, 2d).

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

157

Turning to the proof of (2.1.18), given an [N, d] code, divide the codewords into
two classes: those ending with 0 and those ending with 1. One class must contain
at least half of the codewords. Hence the result.

If d is even and such that 2d > N ,

A
B
d
M2 (N, d) 2
2d N

Corollary 2.1.17

(2.1.19)

and
M2 (2d, d) 4d.

If d is odd and 2d + 1 > N then

M2 (N, d)

(2.1.20)

d +1
2
2d + 1 N

B
(2.1.21)

and
M2 (2d + 1, d) 4d + 4.

(2.1.22)

Proof Inequality (2.1.19) follows from (2.1.17), and (2.1.20) follows from
(2.1.18) and (2.1.19): if d = 2d then
M2 (4d , 2d ) = 2M2 (4d 1, 2d ) 8d = 4d.
Furthermore, (2.1.21) follows from (2.1.17):
M2 (N, d)

M2 (N + 1, d + 1)

B
d +1
2
.
2d + 1 N

Finally, (2.1.22) follows from (2.1.17) and (2.1.20).

Worked Example 2.1.18 Prove the Plotkin bound for a q-ary code:
B
A
q1
q1

, if d > N
.
Mq (N, d) d
d N
q
q

(2.1.23)

Solution Given a q-ary [N, M, d] code XN , observe that the minimal distance d is
bounded by the average distance
d

1
S, where S =
M(M 1)
xX

x X

(x, x ).

As before, let ki j denote the number of letters j {0, . . . , q 1} in the ith position
in all codewords from X , i = 1, . . . , N. Then, clearly,
ki j = M and the
0 jq1

contribution of the ith position into S is

0 jq1

ki j (M ki j ) = M 2

0 jq1

ki2j M 2

M2
q

158

Introduction to Coding Theory

as the quadratic function (u1 , . . . , uq ) u2j achieves its minimum on the set
1 jq
*
+
u = u1 . . . uq : u j 0, u j = M at u1 = = uq = M/q. Summing over all N
digits, we obtain with = (q 1)/q
M(M 1)d M 2 N,
which yields the bound M d(d N)1 . The proof is completed as in the binary
case.
There exists a substantial theory related to the equality in the Plotkin bound
(Hadamard codes) but it will not be discussed in this book. We would also like
to point out the fact that all bounds established so far (Hamming, Singleton, GV
and Plotkin) hold for codes that are not necessarily linear. As far as the GV bound
is concerned, one can prove that it can be achieved by linear codes: see Theorem
2.3.26.
Worked Example 2.1.19 Prove that a 2-error correcting binary code of length
10 can have at most 12 codewords.
Solution The distance of the code must be 5. Suppose that it contains M codewords and extend it to an [11, M] code of distance 6. The Plotkin bound works as
follows. List all codewords of the extended code as rows of an M 11 matrix. If
column i in this matrix contains si zeros and M si ones then
6(M 1)M

xX + x X +

(x, x ) 2 si (M si ).
i=1

The RHS is (1/2) 11 M 2 if M is even and (1/2) 11 (M 2 1) if M is odd.

Hence, M 12.
Worked Example 2.1.20 (Asymptotics of the size of a binary ball) Let q = 2
and (0, 1/2). Then, with ( ) = log2 (1 ) log2 (1 ) (cf. (1.2.2a)),
1
1
log vN,2 ( N) = lim log vN,2 ( N) = ( ).
N N
N N
lim

Solution Observe that with R = N the last term in the sum

R
N
vN,2 (R) =
, R = [ N],
i=0 i
is the largest. Indeed, the ratio of two successive terms is

N i
N
N
=
i+1
i
i+1

(2.1.24)

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

which remains 1 for 0 i R. Hence,

N
N
.
vN,2 (R) (R + 1)
R
R

Now use Stirlings formula: N! N N+1/2 eN 2 . Then

R
N
N R
R log + O (log N)
log
= (N R) log
N
N
R
and

159

(2.1.25)

R
R
R O (log N)
R
log 1
log +
1
N
N
N
N
N
log vN,2 (R)
1
log(R + 1) + the LHS.

N
N

The limit R/N yields the result. The case where R = N is considered in a
similar manner.
Worked Example 2.1.20 is useful in the study of the asymptotics of

(N, ) =

1
log M2 (N, N),
N

(2.1.26)

the information rate for the maximum size of a code correcting N and detecting
2 N errors (i.e. a linear portion of the total number of digits N). Set
a( ) := lim inf (N, ) lim sup (N, ) =: a( )
N

(2.1.27)

For these limits we have

Theorem 2.1.21 With ( ) = log2 (1 ) log2 (1 ), the following
asymptotic bounds hold for binary codes:
a( ) 1 ( /2), 0 1/2 (Hamming),

(2.1.28)

a( ) 1 , 0 1/2 (Singleton),

(2.1.29)

a( ) 1 ( ), 0 1/2 (GV),
a( ) = 0, 1/2 1 (Plotkin).

(2.1.30)
(2.1.31)

By using more elaborate bounds (also due to Plotkin), well show in Problem
2.10 that
a( ) 1 2 ,

0 1/2.

(2.1.32)

The proof of Theorem 2.1.21 is based on a direct inspection of the abovementioned bounds; for Hamming and GV bounds it is carried in Worked Example
2.1.22 later.

160

Introduction to Coding Theory

Plotkin
1

Singleton
Hamming

GilbertVarshamov
1

Figure 2.2

Figure 2.2 shows the behaviour of the bounds established. Good sequences of
codes are those for which the pair ( , (N, N)) is asymptotically confined to
the domain between the curves indicating the asymptotic bounds. In particular, a
good code should lie above the curve emerging from the GV bound. Constructing such sequences is a difficult problem: the first examples achieving the asymptotic GV bound appeared in 1973 (the Goppa codes, based on ideas from algebraic
geometry). All families of codes discussed in this book produce values below the
GV curve (in fact, they yield ( ) = 0), although these codes demonstrate quite
impressive properties for particular values of N, M and d.
As to the upper bounds, the Hamming and Plotkin compete against each other,
while the Singleton bound turns out to be asymptotically insignificant (although
it is quite important for specific values of N, M and d). There are about a dozen
various other upper bounds, some of which will be discussed in this and subsequent
sections of the book.
The GilbertVarshamov bound itself is not necessarily optimal. Until 1982 there
was no better lower bound known (and in the case of binary coding there is still no
better lower bound known). However, if the alphabet used contains q 49 symbols
where q = p2m and p 7 is a prime number, there exists a construction, again based
on algebraic geometry, which produces a different lower bound and gives examples of (linear) codes that asymptotically exceed, as N , the GV curve [159].
Moreover, the TVZ construction carries a polynomial complexity. Subsequently,
two more lower bounds were proposed: (a) Elkies bound, for q = p2m + 1; and
(b) Xings bound, for q = pm [43, 175]. See N. Elkies, Excellent codes from mod-

2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size

161

ular curves, Manipulating with different coding constructions, the GV bound can
also be improved for other alphabets.
Worked Example 2.1.22 Prove bounds (2.1.28) and (2.1.30) (that is, those parts
of Theorem 2.1.21 related to the asymptotical Hamming and GV bounds).
Solution Picking up the Hamming and the GV parts in (2.1.14), we have

(2.1.33)
2N /vN,2 (d 1) M2 (N, d) 2N vN,2 (d 1)/2 .
The lower bound for the Hamming volume is trivial:

N
.
vN,2 (d 1)/2
(d 1)/2
For the upper bound, observe that with d/N < 1/2,
d1i
d 1
N

i
N

d
+
1
0id1

d1i

N
1
N

.
d 1
1 2 d 1
0id1 1

vN,2 (d 1)

Then, for the information rate (log M2 (N, d)) N,

N
1
1
1 log
N
1 2 d 1

N
1
1

log M2 (N, d) 1 log

.
(d 1)/2
N
N
By Stirlings formula, as N the logs in the previous inequalities obey

1
N
N
1
log
( /2), log
( ).
N
(d 1)/2
N
d 1
The bounds (2.1.28) and (2.1.30) then readily follow.
Consider now the case of a general q-ary alphabet.
the argument in Worked
Example 2.1.23 Set := (q 1)/q. By modifying

Example 2.1.22, prove that for any q 2 and 0, ), the volume of the q-ary
Hamming ball has the following logarithmic asymptote:
1
1
logq vN,q ( N) = lim logq vN,q ( N) = (q) ( ) +
N N
N N
lim

(2.1.34)

162

Introduction to Coding Theory

where

(q) ( ) := logq (1 ) logq (1 ), := logq (q 1).

(2.1.35)

Next, similarly to (2.1.26), introduce

(q) (N, ) =

1
log Mq (N, N)
N

(2.1.36)

and the limits

a(q) ( ) := lim inf (q) (N, ) lim sup (q) (N, ) =: a(q) ( ).
N

Theorem 2.1.24

(2.1.37)

For all 0 < < ,

a(q) ( ) 1 (q) ( /2) /2 (Hamming),

(2.1.38)

a ( ) 1

(2.1.39)

(q)

(Singleton),

a(q) ( ) 1 (q) ( )

(GV),

a ( ) max[1 / , 0] (Plotkin).
(q)

(2.1.40)
(2.1.41)

Of course, the minimum of the right-hand sides of (2.1.38), (2.1.39) and (2.1.41)
provides the better of the three upper bounds. We omit the proof of Theorem 2.1.24,
leaving it as an exercise that is a repetition of the argument from Worked Example
2.1.22.
Example 2.1.25 Prove bounds (2.1.38) and (2.1.40), by modifying the solution
to Worked Example 2.1.22.

2.2 A geometric proof of Shannons second coding theorem.

Advanced bounds on the code size
In this section we give alternative proofs of both parts of Shannons second coding theorem (or Shannons noisy coding theorem, SCT/NCT; cf. Theorems 1.4.14
and 1.4.15) by using the geometry of the Hamming space. We then apply the techniques that are developed in the course of the proof for obtaining some advanced
bounds on codes. The advanced bounds strengthen the Hamming bound established
in Theorem 2.1.6 and its asymptotic counterparts in Theorems 2.1.21 and 2.1.24.
The direct part of the SCT/NCT is given in Theorem 2.2.1 below, in a somewhat
modified form compared with Theorems 1.4.14 and 1.4.15. For simplicity, we only
consider here memoryless binary symmetric channels (MBSC), working in space
HN,2 = {0, 1}N (the subscript 2 will be omitted for brevity). As we learned in
Section 1.4, the direct part of the SCT states that for any transmission rate R < C
there exist

2.2 A geometric proof of Shannons second coding theorem

163

(i) a sequence of codes fn : Un HN , encoding a total of Un = 2n messages;

and
(ii) a sequence of decoding rules fN : HN Un , such that n NR and the probability of erroneous decoding vanishes as n .
Here C is given by (1.4.11) and (1.4.27). For convenience, we reproduce the expression for C again:
C = 1 (p), where (p) = p log p (1 p) log(1 p)
and the channel matrix is

1 p
p
.
=
p
1 p

(2.2.1)

(2.2.2)

That is, we assume that the channel transmits a letter correctly with probability
1 p and reverses with probability p, independently for different letters.
In Theorem 2.2.1, it is asserted that there exists a sequence of one-to-one coding maps fn , for which the task of decoding is reduced to guessing the codewords
fn (u) HN . In other words, the theorem guarantees that for all R < C there exists a
sequence of subsets XN HN with XN 2NR for which the probability of incorrect guessing tends to 0, and the exact nature of the coding map fn is not important.
Nevertheless, it is convenient to keep the map fn firmly in sight, as the existence
will follow from a probabilistic construction (random coding) where sample coding maps are not necessarily one-to-one. Also, the decoding rule is geometric: upon
receiving a word a(N) HN , we look for the nearest codeword fn (u) XN . Consequently, an error is declared every time such a codeword is not unique or is a result
of multiple encodings or simply yields a wrong message. As we saw earlier, the
geometric decoding rule corresponds with the ML decoder when the probability
p (0, 1/2). Such a decoder enables us to use geometric arguments constituting
the core of the proof.
Again as in Section 1.4, the new proof of the direct part of the SCT/NCT only
guarantees the existence of good codes (and even their proliferation) but gives
no clue on how to construct such codes [apart from running again a random coding
scheme and picking its typical realisation].
In the statement of the SCT/NCT given below, we deal with the maximum errorprobability (2.2.4) rather than the averaged one over possible messages. However,
a large part of the proof is still based on a direct analysis of the error-probabilities
averaged over the codewords.
Theorem 2.2.1 (The SCT/NCT, the direct part) Consider an MBSC with channel
matrix as in (2.2.2), with 0 p < 1/2, and let C be as in (2.2.1). Then for any

164

Introduction to Coding Theory

R (0,C) there exists a sequence of one-to-one encoding maps fn : Un HN

such that
(i)
n = NR , and Un = 2n ;

(2.2.3)

(ii) as n , the maximum error-probability under the geometric decoding rule

vanishes:

emax ( fn ) = max Pch error under geometric decoding

(2.2.4)
| fn (u) sent : u Un 0.

Here Pch | fn (u)sent stands for the probability distribution of the received
word in HN generated by the channel, conditional on codeword fn (u) being
sent, with u Un being the original message emitted by the source.
As an illustration of this result, consider the following.
n , where the size of alphabet
We wish to send a message u A

0.85 0.15
A equals K, through an MBSC with channel matrix
. What rate of
0.15 0.85
transmission can be achieved with an arbitrarily small probability of error?
Here the value C = 1 (0.15) = 0.577291. Hence, by Theorem 2.2.1 any rate
of transmission < 0.577291 can be achieved for n large enough, with an arbitrarily
small probability of error. For example, if we want a rate of transmission 0.5 < R <
0.577291 and emax < 0.015 then there exist codes fn : A n {0, 1}n/R achieving
this goal provided that n is sufficiently large: n > n0 .
Suppose that we know such a code fn . How do we encode the message m? First
divide m into blocks of length L where
E
F
0.577291N
L=
so that |A L | = K L 20.577291N .
log K

Example 2.2.2

Then we can embed the blocks from A L in the alphabet A n and so encode the
blocks. The transmission rate is log |A L | n/R 0.577291. As was mentioned,
the SCT tells us that there are such codes but gives no idea of how to find (or
construct) them, which is difficult to do.
Before we embark on the proof of Theorem 2.2.1, we would like to explore
connections between the geometry of Hamming space HN and the

randomness
generated by the channel. As in Section 1.4, we use the symbol P | fn (u) as a
shorthand for Pch | fn (u) sent . The
expectation and
variance under this distribution will be denoted by E( | fn (u) and Var( | fn (u) .

2.2 A geometric proof of Shannons second coding theorem

165

Observe that, under distribution P | fn (u) , the number of distorted digits in
the (random) received word Y(N) can be written as
N

1(digit j in Y(N) = digit j in fn (u)).

j=1

This is a random variable which has a binomial distribution Bin(N, p), with the
mean value

N

(N)
E 1(digit j in Y = digit j in fn (u)) fn (u)

j=1

N
= E 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p,
j=1

and the variance

Var

1(digit j in

j=1
N

Y(N)

digit j in fn (u)) fn (u)
=

= Var 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p(1 p).
j=1

Then, by Chebyshevs inequality, for all given (0, 1 p) and positive integer
N > 1/ , the probability that at least N(p + ) digits have been distorted given
that the codeword fn (u) has been sent, is

p(1 p)
.
(2.2.5)
P N(p + ) 1 distorted | fn (u)
N( 1/N)2
Proofof Theorem 2.2.1. Throughout the proof, we follow the set-up from (2.2.3).
Subscripts n and N will be often omitted; viz., we set
2n = M.
We will assume the ML/geometric decoder without any further mention. Similarly
to Section 1.4, we identify the set of source messages Un with Hamming space Hn .
As proposed by Shannon, we use again a random encoding. More precisely, a message u Hn is mapped to a random codeword Fn (u) HN , with IID digits taking
values 0 and 1 with probability 1/2 and independently of each other. In addition,
we make codewords Fn (u) independent for different messages u Hn ; labelling
the strings from Hn by u(1), . . . , u(M) (in no particular order) we obtain a family of IID random strings Fn (u(1)), . . . , Fn (u(M)) from HN . Finally, we make the
codewords independent of the channel. Again, in analogy with Section 1.4, we can
think of the random code under consideration as a random megastring/codebook
from HNM = {0, 1}NM with IID digits 0, 1 of equal probability. Every given sample
f (= fn ) of this random codebook (i.e. any given megastring from HNM ) specifies

166

Introduction to Coding Theory

f
f(u(1))

f( u(2))

f (u(r))

Figure 2.3

a deterministic encoding f (u(1)), . . . , f (u(M)) of messages u(1), . . . , u(M), i.e. a

code f ; see Figure 2.3.
As in Section 1.4, we denote by P n the probability distribution of the random
code, with
P n (Fn = f ) =

1
2NM

for all sample megastrings f ,

(2.2.6)

and by E n the expectation relative to P n .

The plan of the rest of the proof is as follows. First, we will prove (by repeating
in part arguments from Section 1.4) that, for the transmission rate R (0,C), the
expected average probability for the above random coding goes to zero as n :

(2.2.7)
lim E n eave (Fn ) = 0.
n

eave (Fn ),

which is shorthand for eave (Fn (u(1)), . . . , Fn (u(M))), is a random

Here
variable taking values in (0, 1) and representing the aforementioned average errorprobability for the random coding in question. More precisely, as in Section 1.4,
for all given sample collections of codewords f (u(1)), . . . , f (u(M)) HN (i.e. for
all given megastrings f from HNM ), we define

1
P
error
while
using
codebook
f
|
f
(u(i))
.
(2.2.8)
eave fn =

M 1iM
Then the expected average error-probability is given by

1
eave f .
E n eave (Fn ) = NM

2
f (u(1)),..., f (u(M))H

(2.2.9)

Relation (2.2.9) implies (again in a manner similar to Section 1.4) that there
exists a sequence of deterministic codes fn such that the average error-probability
eave ( fn ) = eave ( fn (u(1)), . . . , fn (u(2n ))) obeys
lim eave ( fn ) = 0.

Finally, we will deduce (2.2.4) from (2.2.10): see Lemma 2.2.6.

(2.2.10)

2.2 A geometric proof of Shannons second coding theorem

167

fn (u(i))
decoding

fn (u(i)) sent
N

...
..
a

a received

(a, m)

Figure 2.4

Remark 2.2.3 As the codewords f (u(1)), . . . , f (u(M)) are thought to come

from a sample of the random codebook, we must allow that they may coincide
( f (u(i)) = f (u( j)) for i = j), in which case, by default, the ML decoder is reporting an error. This must be included when we consider probabilities in the RHS of
(2.2.8). Therefore, for i = 1, . . . , M we define

P error while using codebook f | f (u(i))

1, if f (u(i)) = f (u(i )) for some i = i,

P Y(N) , f (u( j)) Y(N) , f (u(i))

(2.2.11)

=

for some j = i | f (u(i)) ,

if f (u(i)) = f (u(i )) for all i = i.

Let us now go through the detailed argument. The first step is
Lemma 2.2.4 Consider the channel matrix (cf. (2.2.2)) with 0 p < 1/2.
Suppose that the transmission rate R < C = 1 (p). Let N be > 1/. Then for
any (0, 1/2 p), the expected average error-probability E n eave (Fn ) defined
in (2.2.8), (2.2.9) obeys

E n eave (Fn )

4
p(1 p)
M 1 3
+
v

)
,
N(p
+
N
N( 1/N)2
2N

(2.2.12)

where vN (b) stands for the number of points in the ball of radius b in the binary
Hamming space HN .
Proof Set m(= mN (p, )) := N(p + ). The ML decoder definitely returns the
codeword fn (u(i)) sent through the channel when fn (u(i)) is the only codeword in

168

Introduction to Coding Theory

the Hamming ball BN (y, m) around the received word y = y(N) HN (see
Figure 2.4). In any other situation (when fn (u(i)) BN (y, m) or fn (u(k))
BN (y, m) for some k = i) there is a possibility of error.
Hence,

P error while using codebook f | fn (u(i))

P y| fn (u(i)) 1 fn (u(i)) BN (y, m)
(2.2.13)
yHN

+ P(z| fn (u(i))) 1 fn (u(k)) BN (z, m) .
zHN

k=i

The first sum in the RHS is simple to estimate:

P y| fn (u(i)) 1 fn (u(i) BN (y, m)
yHN

= P y| fn (u(i)) 1 distance y, fn (u(i)) m
yHN

p(1 p)
,
= P m digits distorted| f (u(i))
N( 1/N)2

(2.2.14)

by virtue of (2.2.5). Observe that since the RHS in (2.2.14) does not depend on the
choice of the sample code f , the bound (2.2.14) will hold when we take first the
1
average
and then expectation E n .
M 1iM
The second sum in the RHS of (2.2.13) is more tricky: it requires averaging and
taking expectation. Here we have

En

P(z|Fn (u(i))) 1 Fn (u(k)) BN (z, m)

1iM zHN
k=i

= En P(z|Fn (u(i)))1 Fn (u(k)) BN (z, m)
(2.2.15)
1iM k=i zHN

= En P(z|Fn (u(i)))
1iM k=i zHN

En 1 Fn (u(k)) BN (z, m) ,
since random codewords Fn (u(1)), . . . , Fn (u(M)) are independent. Next, as
each
of these codewords
is uniformly distributed

over HN , the expectations
En P(z|Fn (u(i))) and En 1 Fn (u(k)) BN (z, m) can be calculated as

1
En P(z|Fn (u(i))) = N
2
and

P(z|x)

(2.2.16a)

xHN

vN (m)

.
En 1 Fn (u(k)) BN (z, m) =
2N

(2.2.16b)

2.2 A geometric proof of Shannons second coding theorem

169

Further, summing over z yields

zHN xHN

P(z|x) =

P(z|x) = 2N .

(2.2.17)

xHN zHN

Finally, after summation over k = i we obtain

vN (m)
1

M 1iM k=i 2N
(2.2.18)
vN (m)M(M 1) (M 1)vN (m)
=
=
.
2N M
2N
Collecting (2.2.12)(2.2.18) we have that EN [eave (Fn )] does not exceed the RHS
of (2.2.12).
the RHS of (2.2.15) =

At the next stage we estimate the volume vN (m) in terms of entropy h(p + )
where, recall, m = N(p + ). The argument here is close to that from Section 1.4
and based on the following result.
Lemma 2.2.5 Suppose that 0 < p < 1/2, > 0 and positive integer N satisfy
p + + 1/N < 1/2. Then the following bound holds true:
vN (N(p + )) 2N (p+ ) .

(2.2.19)

The proof of Lemma 2.2.5 will be given later, after Worked Example 2.2.7. For
the moment we proceed with the proof of Theorem 2.2.1. Recall, we want to establish (2.2.7). In fact, if p < 1/2 and R < C = 1 (p) then we set = C R > 0
and take > 0 so small that (i) p + < 1/2 and (ii) R + /2 < 1 (p + ). Then
we take N so large that (iii) N > 2/ . With this choice of and N, we have

1
> and R 1 + (p + ) < .
N
2
2
Then, starting with (2.2.12), we can write

(2.2.20)

4p(1 p) 2NR N (p+ )
EN e(Fn )
+ N 2
N 2
2
(2.2.21)
4
N /2
<
p(1

p)
+
2
.
N 2
This implies (2.2.7) and hence the existence of a sequence of codes fn : Hn HN
obeying (2.2.10).
To finish the proof of Theorem 2.2.1, we deduce (2.2.4) from (2.2.7), in the form
of Lemma 2.2.6:

Consider a binary channel (not necessarily memoryless), and

Lemma 2.2.6
let C > 0 be a given constant. With 0 < R < C and n = NR, define quantities
emax ( fn ) and eave ( fn ) as in (2.2.4), (2.2.8) and (2.2.11), for codes fn : Hn HN
and fn : Hn HN . Then the following statements are equivalent:

170

Introduction to Coding Theory

(i) For all R (0,C), there exist codes fn with lim emax ( fn ) = 0.
n

(ii) For all R (0,C), there exist codes fn such that lim eave ( fn ) = 0.
n

Proof of Lemma 2.2.6. It is clear that assertion (i) implies (ii). To deduce (i) from
(ii), take R < C and set for N big enough
R = R +

1
< C, n = NR , M = 2n .
N

(2.2.22)

We know that there exists a sequence fn of codes Hn HN with eave ( fn ) 0.

Recall that

1
(2.2.23)
eave ( fn ) = P error while using fn | fn (u(i)) .
M 1iM

Here and below, M = 2NR and fn (u(1)), . . . , fn (u(M )) are the codewords for
source messages u(1), . . . , u(M ) Hn .

Instead of P error while using fn | fn (u(i)) , we write P fn -error| fn (u(i)) ,

for brevity. Now, at least half of summands P fn -error| fn (u(i)) in the RHS of
(2.2.23) must be < 2eave ( fn ). Observe that, in view of (2.2.22),
M /2 2NR1 .

(2.2.24)

Hence we have at our disposal at least 2NR1 codewords f (u(i)) with

P error| fn (u(i)) < 2eave ( fn ).
List
these
codewords as a new binary code, of length N and information rate

log M /2 N. Denoting this new code by fn , we have
emax ( fn ) 2eave ( fn ).

Hence, emax ( fn ) 0 as n whereas log M /2 N R. This gives statement
(i) and completes the proof of Lemma 2.2.6.
Therefore, the proof of Theorem 2.2.1 is now complete (provided that we prove
Lemma 2.2.5).
Worked Example 2.2.7 (cf. Worked Example 2.1.20.) Prove that for positive
integers N and m, with m < N/2 and = m/N ,

2N ( ) (N + 1) < vN (m) < 2N ( ) .
(2.2.25)

2.2 A geometric proof of Shannons second coding theorem

Solution Write
*
+
vN (m) = points at distance m from 0 in HN =

171

N
k .
0km

With = m/N < 1/2, we have that /(1 ) < 1, and so

m
k

<
, for 0 k < m.
1
1
Then, for 0 k < m, the product

=
(1 )N
1

>
(1 )N = m (1 )Nm .
1

k (1 )Nk

Hence,

N
N
k (1 )Nk >

k (1 )Nk

k
k
0kN
0km

N
m
Nm
= vN (m) m (1 )Nm
> (1 )

0km k

1 =

= vN (m)2N[ (m/N) log +(1m/N) log(1 ) ] ,

implying that vN (m) < 2N ( ) . To obtain the left-hand bound in (2.2.25), write

N
vN (m) >
;
m
then we aim to check that the RHS is 2N ( ) /(N + 1). Consider a binomial
random variable Y Bin(N, ) with

N
pk = P(Y = k) =
k (1 )Nk , k = 0, . . . , N.
k
It suffices to prove that pk achieves its maximal value when k = m, since then

1
N
, with m (1 )Nm = 2N ( ) .
m (1 )Nm
pm =
m
N +1
To this end, suppose first that k m and write
pk
pm

m!(N m)!(N m)mk

k!(N k)!mmk
(k + 1) m
(N m)mk
.
=

mmk
(N m + 1) (N k)
=

172

Introduction to Coding Theory

Here, the RHS is 1, as it is the product of 2(m k) factors each of which is 1.

Similarly, if k m, we arrive at the product
mkm
(N k + 1) (N m)

(m + 1) k
(N m)km
which is again 1 as the product of 2(k m) factors 1. Thus, the ratio pk /pm
1, and the desired bound follows.
We are now in position to prove Lemma 2.2.5.
Proof of Lemma 2.2.5 First, p + < 1/2 implies that m = N(p + ) < N/2 and

m N(p + )
=
< p + ,
N
N

which, in turn, implies that ( ) < (p + ) as x (x) is a strictly increasing

function for x from the interval (0, 1/2). This yields the assertion of Lemma 2.2.5.
The geometric proof of the direct part of SCT/NCT clarifies the meaning of the
concept of capacity (of an MBSC at least). Physically speaking, in the expressions
(1.4.11), (1.4.27) and (2.2.1) for capacity C = (p) of an MBSC, the positive term
1 points at the rate at which a random code produces an empty volume between
codewords whereas the negative term (p) indicates the rate at which the codewords progressively fill this space. We continue with a working example of an
essay type:
Worked Example 2.2.8 Quoting general theorems on the evaluation of the channel capacity, deduce an expression for the capacity of a memoryless binary symmetric channel. Evaluate, in particular, the capacities of (i) a symmetric memoryless channel and (ii) a perfect channel with an input alphabet {0, 1} whose inputs
are subject to the restriction that 0 should never occur in succession.
Solution The channel capacity is defined as a supremum of transmission rates R for
which the received message can be decoded correctly, with probability approaching
1 as the length of the message increases to infinity. A popular class is formed by
memoryless channels where, for a given input word x(N) = x1 . . . xN , the probability

(N)
(N)
(N)
received|x
sent = P(yi |xi ).
P
y
1iN

In other words, the noise acts on each symbol xi of the input string x independently,
and P(y|x) is the probability of having an output symbol y given that the input
symbol is x.

2.2 A geometric proof of Shannons second coding theorem

173

Symbol x runs over Ain , an input alphabet of a given size q, and y belongs to
Aout , an output alphabet of size r. Then probabilities P(y|x) form a q r stochastic
matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of this matrix are permutations of each other, i.e. contain the same collection of
probabilities, say p1 , . . . , pr . A memoryless symmetric channel is said to be doublesymmetric if the columns of the channel matrix are also permutations of each other.
If m = n = 2 (typically, Ain = Aout = {0, 1}) a memoryless channel is called binary.
For a memoryless binary symmetric channel, the channel matrix entries P(y|x)
are P(0|0) = P(1|1) = 1 p, P(1|0) = P(0|1) = p, p (0, 1) being the flipping
probability and 1 p the probability of flawless transmission of a single binary
symbol.
A channel is characterised by its capacity: the value C 0 such that:
(a) for all R < C, R is a reliable transmission rate; and
(b) for all R > C, R is an unreliable transmission rate.
Here R is called a reliable transmission rate if there exists a sequence of codes
fn : Hn HN and decoding rules fN : HN Hn such that n NR and the (suitably defined) probability of error
e( fn , fN ) 0, as N .
In other words,
C = lim

1
log MN
N

where MN is the maximal number of codewords x HN for which the probability

of erroneous decoding tends to 0.
The SCT asserts that, for a memoryless channel,
C = max I(X : Y )
pX

where I(X : Y ) is the mutual information between a (random) input symbol X and
the corresponding output symbol Y , and the maximum is over all possible probability distributions pX of X.
Now in the case of a memoryless symmetric channel (MSC), the above maximisation procedure applies to the output symbols only:

C = max h(Y ) + pi log pi ;
pX

1ir

the sum pi log pi being the entropy of the row of channel matrix (P(y|x)). For
i

a double-symmetric channel, the expression for C simplifies further:

C = log M h(p1 , . . . , pr )

174

Introduction to Coding Theory

as h(Y ) is achieved at equidistribution pX , with pX (x) 1/q (and pY (y) 1/r). In

the case of an MBSC we have
C = 1 (p).
This completes the solution to part (i).
Next, the channel in part (ii) is not memoryless. Still, the general definitions are
applicable, together with some arguments developed so far. Let n( j,t) denote the
number of allowed strings of length t ending with letter j, j = 0, 1. Then
n(0,t) = n(1,t 1),
n(1,t) = n(0,t 1) + n(1,t 1),
whence
n(1,t) = n(1,t 1) + n(1,t 2).
Write it as a recursion

with the recursion matrix

n(1,t 1)
n(1,t)
,
=A
n(1,t 2)
n(1,t 1)

1 1
A=
.
1 0

The general solution is

n(1,t) = c1 1t + c2 2t ,
where 1 , 2 are the eigenvalues of A, i.e. the roots of the characteristic equation
det (A I) = (1 )( ) 1 = 2 1 = 0.

So, = 1 5 2, and

5+1
1
log n(1,t) = log
.
t
2
The capacity of the channel is given by
1
log of allowed input strings of length t
t

1
5+1
.
= lim log n(1,t) + n(0,t) = log
t t
2

C = lim

2.2 A geometric proof of Shannons second coding theorem

175

Remark 2.2.9

We canmodify the last

question, by considering an MBC with
1 p
p
the channel matrix =
whose input is under a restriction that 0
p
1 p
should never occur in succession. Such a channel may be treated as a composition of two consecutive channels (cf. Worked Example 1.4.29(a)), which yields the
following answer for the capacity:

5+1
, 1 (p) .
C = min log
2
Next, we present the strong converse part of Shannons SCT for an MBSC (cf.
Theorem 1.4.14); again we are going to prove it by using geometry of Hammings
spaces. The term strong indicates that for every transmission rate R > C, the
channel capacity, the maximum probability of error actually gets arbitrarily close
to 1. Again for simplicity, we prove the assertion for an MBSC.
(The SCT/NCT, thestrong converse
part) Let C be the capacity
1 p
p
of an MBSC with the channel matrix
, where 0 < p < 1/2, and
p
1 p
take R > C. Then, with n = NR, for all codes fn : Hn HN and decoding rules
fN : HN Hn , the maximum error-probability

max ( fn , fN ) := max P error under fN | fn (u) : u Hn
(2.2.26a)
Theorem 2.2.10

obeys
lim sup max ( fn , fN ) = 1.

(2.2.26b)

Proof As in Section 1.4, we can assume that codes fn are one-to-one and obey
fN ( fn (u)) = u, for all u Hn (otherwise, the chances of erroneous decoding will
be even larger). Assume the opposite of (2.2.26b):

max ( fn , fN ) c for some c < 1 and all N large enough.

(2.2.27)

Our aim is to deduce from (2.2.27) that R C. As before, set n = NR and let
fn (u(i)) be the codeword for string u(i) Hn , i = 1, . . . , 2n . Let Di HN be the set
of binary strings where fN returns fn (u(i)): fN (a) = fn (u(i)) if and only if a Di .
Then Di ! f (u(i)), sets Di are pairwise disjoint, and if the union i Di = HN then
on the complement HN \ i Di the channel declares an error. Set si = Di , the size
of set Di .
Our first step is to improve the decoding rule, by making it closer to the ML
rule. In other words, we want to replace each Di with a new set, Ci HN , of the
same cardinality Ci = si , but of a more rounded shape (i.e. closer to a Hamming

176

Introduction to Coding Theory

ball B( f (u(i)), bi )). That is, we look for pairwise disjoint sets Ci , of cardinalities
Ci = si , satisfying
BN ( f (u(i)), bi ) Ci BN ( f (u(i)), bi + 1), 1 i 2n ,

(2.2.28)

for some values of radius bi 0, to be specified later. We can think that Ci is

obtained from Di by applying a number of disjoint swaps where we remove a
string a and add another string, b, with the Hamming distance

b, fn (u(i)) a, fn (u(i)) .
(2.2.29)
Denote the new decoding rule by gN . As the flipping probability p < 1/2, the
relation (2.2.29) implies that
P( fN returns fn (u(i))| fn (u(i))) = P(Di | fn (u(i)))
P(Ci | fn (u(i))) = P(
gN returns fn (u(i))| fn (u(i))),
which in turn is equivalent to
P(error when using gN | fn (u(i))) P(error when using fN | fn (u(i))).

(2.2.30)

Then, clearly,

max ( fn , gN ) max ( fn , fN ) c.

(2.2.31)

Next, suppose that there exists p < p such that, for any N large enough,
bi + 1 N p for some 1 i 2n .

(2.2.32)

Then, by virtue of (2.2.28) and (2.2.31), with Cic standing for the complement
HN \ Ci ,
P(at least N p digits distorted| fn (u(i)))
P(at least bi + 1 digits distorted| fn (u(i)))
P(Cic | fn (u(i))) max ( fn , gN ) c.
This would lead to a contradiction, since, by the law of large numbers, as N ,
the probability
P(at least N p digits distorted | x sent) 1
uniformly in the choice of the input word x HN . (In fact, this probability does
not depend on x HN .)
Thus, we cannot have p (0, p) such that, for N large enough, (2.2.32) holds
true. That is, the opposite is true: for any given p (0, p), we can find an arbitrarily
large N such that
bi > N p ,

for all i = 1, . . . , 2n .

(2.2.33)

2.2 A geometric proof of Shannons second coding theorem

177

(As we claim (2.2.33) for all p (0, p), it does not matter if in the LHS of (2.2.33)
we put bi or bi + 1.)
At this stage we again use the explicit expression for the volume of the Hamming
ball:

N
N
si = Di = Ci vN (bi ) =

j
bi
0 jbi

N
, provided that bi > N p .
(2.2.34)

N p
A useful bound has been provided in Worked Example 2.2.7 (see (2.2.25)):

R
1
2N N .
(2.2.35)
vN (R)
N +1
We are now in a position to finish the proof of Theorem 2.2.10. In view of
(2.2.35), we have that, for all p (0, p), we can find an arbitrarily large N such
that

si 2N( (p )N ) ,

for all 1 i 2n ,

with lim N = 0. As the original sets D1 , . . . , D2n are disjoint, we have that
N

s1 + + s2n 2N , implying that 2N( (p )N ) 2NR 2N ,

NR
1
1, implying that R 1 (p ) + N + .
N
N

As N , the RHS tends to 1 (p). So, given any p (0, p), R 1 (p ).
This is true for all p < p, hence R 1 (p) = C. This completes the proof of
Theorem 2.2.10.

(p ) N +

We have seen that the analysis of intersections of a given set X in a Hamming

space HN (and more generally, in HN,q ) with various balls BN (y, s) reveals a lot
about the set X itself. In the remaining part of this section such an approach will
be used for producing some advanced bounds on q-ary codes: the Elias bound and
the Johnson bound. These bounds are among the best-known general bounds for
codes, and they are competing.
The Elias bound is proved in a fashion similar to Plotkins: cf. Theorem 2.1.15
and Worked Example 2.1.18. We count codewords from a q-ary [N, M, d] code X
in balls BN,q (y, s) of radius s about words y HN,q . More precisely, we count pairs
(x, BN,q (y, s)) where x X BN,q (y, s). If ball BN,q (y, s) contains Ky codewords
then

yHN

Ky = MvN,q (s)

(2.2.36)

178

Introduction to Coding Theory

as each word x falls in vN,q (s) of the balls BN,q (y, s).
Lemma 2.2.11 If X is a q-ary [N, M] code then for all s = 1, . . . , N there
exists a ball BN,q (y, s) about an N -word y HN,q with the number Ky = X
BN,q (y, s) of codewords in BN,q (y, s) obeying
Ky MvN,q (s)/qN .

(2.2.37)

1
Ky gives the average numqN y
ber of codewords in ball BN,q (y, s). But there must be a ball containing at least as
many as the average number of codewords.
Proof Divide both sides of (2.2.36) by qN . Then

A ball BN,q (y, s) with property (2.2.37) is called critical (for code X ).
Theorem 2.2.12 (The Elias bound) Set = (q 1)/q. Then for all integers s 1
such that s < N and s2 2 Ns + Nd > 0, the maximum size Mq (N, d) of a q-ary
code of length N and distance d satisfies
Mq (N, d)

qN
Nd

.
s2 2 Ns + Nd vN,q (s)

(2.2.38)

Proof Fix a critical ball BN,q (y, s) and consider code X obtained by subtracting
word y from the codewords of X : X = {x y : x X }. Then X is again an
[N, M, d] code. So, we can assume that y = 0 and BN,q (0, s) is a critical ball.
Then take X1 = X BN,q (0, s) = {x X : w(x) s}. The code X1 is [N, K, e]
where e d and K (= K0 ) MvN,q (s)/qN . As in the proof of the Plotkin bound,
consider the sum of the distances between the codewords in X1 :
S1 =

xX1 x X1

(x, x ).

Again, we have that S1 K(K 1)e. On the other hand, if ki j is the number of
letters j Jq = {0, . . . , q 1} in the ith position in all codewords x X1 then
S1 =

ki j (K ki j ).

1iN 0 jq1

Note that the sum

0 jq1

ki j = K. Besides, as w(x) s, the number of 0s in every

word x X1 is N s. Then the total number of 0s in all codewords equals

ki0 K(N s). Now write
1iN

S = NK
2

1iN

2
+
ki0

1 jq1

ki2j

2.2 A geometric proof of Shannons second coding theorem

179

and use the CauchySchwarz inequality to estimate

1
ki2j q 1
1 jq1
Then

NK 2

1iN

ki j

1 jq1

1
(K ki0 )2 .
q1

1
2
(K ki0 )
q1

2 + K 2 2Kk + k2
(q 1)ki0
i0
i0

2 +
ki0

1
q 1 1iN
1
= NK 2
(qk2 + K 2 2Kki0 )
q 1 1iN i0
N
q
2 + 2 K
K2
= NK 2
ki0
ki0
q1
q 1 1iN
q 1 1iN
q2
q
2 + 2 KL,
NK 2
=
ki0
q1
q 1 1iN
q1
= NK 2

where L = ki0 . Use CauchySchwarz once again:

1iN

2
ki0

ki0

1iN

1 2
L .
N

Then
q2
q 1 2
2
NK 2
L +
KL
q1
q1 N
q1
1
q
=
(q 2)NK 2 L2 + 2KL .
q1
N

The maximum of the quadratic expression in the square brackets occurs at L =

NK/q. Recall that L K(N s). So, choosing K(N s) NK/q, i.e. s N(q
1)/q, we can estimate

q 2
1
2
2
2
(q 2)NK K (N s) + 2K (N s)
S
q1
N

qs
1
K 2 s 2(q 1)
.
=
q1
N
This yields the inequality K(K 1)e
solved for K:
K

qs
1
K 2 s 2(q 1)
which can be
q1
N

Ne
s2 2 Ns + Ne

180

Introduction to Coding Theory

provided that s < N and s2 2 Ns + Ne > 0. Finally, recall that X (1) arose
from an [N, M, d] code X , with K Mv(s)/qN and e d. As a result, we obtain
that
MvN,q (s)
Nd
.
2
qN
s 2 Ns + Nd
This leads to the Elias bound (2.2.38).
The ideas used in the proof of the Elias bound (and earlier in the proof of the
Plotkin bounds) are also helpful in obtaining bounds for W2 (N, d, ), the maximal
size of a binary (non-linear) code X HN,2 of length N, distance d(X ) d and
with the property that the weight w(x) , x X . First, three obvious statements:
A B
N

(i) W2 (N, 2k, k) =

,
k
(ii) W2 (N, 2k, ) = W2 (N, 2k, N ),
(iii) W2 (N, 2k 1, ) = W2 (N, 2k, ), /2 k .
[The reader is advised to prove these as an exercise.]
Worked Example 2.2.13
?
N2
N
kN ,
<
2
4

Prove that for all positive integers N 1, k

W2 (N, 2k, )

B
kN
.
2 N + kN

N
and
2

(2.2.39)

Solution Take an [N, M, 2k] code X such that w(x) , x X . As before, let
ki1 be the number of 1s in
position
i in all codewords. Consider the sum of the

dot-products D = 1 x = x "x x #. We have
x,x X

1
"x x # = w(x x ) = w(x) + w(x ) x, x )
2
1
(2 2k) = k
2
and hence
D ( k)M(M 1).
On the other hand, the contribution to D from position i equals ki1 (ki1 1), i.e.
D=

1iN

ki1 (ki1 1) =

1iN

2
(ki1
ki1 ) =

1iN

2
ki1
M.

2.2 A geometric proof of Shannons second coding theorem

181

Again, the last sum is minimised at ki1 = M/N, i.e.

2 M 2
M D ( k)M(M 1).
N
This immediately leads to (2.2.39).
Another useful bound is given now.
Worked Example 2.2.14
2k 4k,

Prove that for all positive integers N 1, k

W2 (N, 2k, )

B
N
W (N 1, 2k, 1) .
2

N
and
2

(2.2.40)

Solution Again take an [N, M, 2k] code X such that w(x) for all x X .
Consider the shortening code X on x1 = 1 (cf. Example 2.1.8(v)): it gives a code
of length (N 1), distance 2k and constant weight ( 1). Hence, the size of
the cross-section is W2 (N 1, 2k, 1). Therefore, the number of 1s at position
1 in the codewords of X does not exceed W2 (N 1, 2k, 1). Repeating this
argument, we obtain that the total number of 1s in all positions is NW2 (N
1, 2k, 1). But this number equals M, i.e. M NW2 (N 1, 2k, 1). The
bound (2.2.40) then follows.
Corollary 2.2.15

For all positive integers N 1, k

N
and 2k 4k 2,
2

W2 (N, 2k 1, ) = W2 (N, 2k, )

A A
B BBB
A A
N +k
N N 1

1
k

(2.2.41)

The remaining part of Section 2.2 focuses on the Johnson bound. This bound
aims at improving the binary Hamming bound (cf. (2.1.8b) with q 2):

(2.2.42)
M2 (N, 2E + 1) 2N vN (E) or vN (E) M2 (N, 2E + 1) 2N .
Namely, the Johnson bound asserts that
M2 (N, 2E + 1) 2N /vN (E) or vN (E) M2 (N, 2E + 1) 2N ,
where
vN (E)

N
E +1

2E + 1
.
W2 (N, 2E + 1, 2E + 1)
E

1
= vN (E) +
N/(E + 1)

(2.2.43)

(2.2.44)

182

Introduction to Coding Theory

N
Recall that vN (E) =
stands for the volume of the binary Hamming ball
0sE s
of radius E. We begin our derivation of bound (2.2.43) with the following result.
Lemma

2.2.16 If x, y are binary words, with (x, y) = 2 + 1, then there exists
2 + 1
binary words z such that (x, z) = + 1 and (y, z) = .

Proof Left as an exercise.
Consider the set T (= TN,E+1 ) of all binary N-words at distance exactly E + 1
from the codewords from X :
(
T = z HN : (z, x) = E + 1 for some x X
)
and (z, y) E + 1 for all y X .
(2.2.45)
Then we can write that
MvN (E) + T 2N ,

(2.2.46)

as none of the words z T falls in any of the balls of radius E about the codewords
y X . The bound (2.2.43) will follow when we solve the next worked example.
Worked Example 2.2.17 Prove that the cardinality T is greater than or equal
to the second term from the RHS of (2.2.44):

2E + 1
N
M

.
(2.2.47)
W2 (N, 2E + 1, 2E + 1)
E
N/(E + 1) E + 1
Solution We want to find a lower bound on T . Consider the set W (= WN,E+1 ))
of matched pairs of N-words defined by
)
(
W = (x, z) : x X , z TE+1 , (x, z) = E + 1
(
(2.2.48)
= (x, z) : x X , z HN : (x, z) = E + 1,
)
and (y, z) E + 1 for all y X .
Given x X , the x-section W x is defined as
W x = {z HN : (x, z) W }
= {z : (x, z) = E + 1, (y, z) E + 1 for all y X }.

(2.2.49)

Observe that if (x, z) = E + 1 then (y, z) E for all y X \ {x}, as otherwise

(x, y) < 2E + 1. Hence:
W x = {z : (x, z) = E + 1, (y, z) = E

for all y X }.

(2.2.50)

2.2 A geometric proof of Shannons second coding theorem

183

We see that, to evaluate W x , we must detract,from the

number of binary NN
words lying at distance E + 1 from x, i.e. from
, the number of those
E +1
lying also at distance E from some other codeword y X . But if (x, z) = E + 1
and (y, z) = E then (x, y) = 2E + 1. Also, no two distinct codewords can have
distance E from a single N-word z. Hence, by the previous remark,

*
+
N
2E + 1
x

y X : (x, y) = 2E + 1 .
W =
E +1
E
Moreover, if we subtract x from every y X with (x, y) = 2E + 1, the result
is a code of length N whose codewords z have weight w(z) 2E + 1. Hence,
there are at most W (N, 2E + 1, 2E + 1) codewords y X with (x, y) = 2E + 1.
Consequently,

N
2E + 1
x

W (N, 2E + 1, 2E + 1)
(2.2.51)
W
E +1
E
and
W M the RHS of (2.2.51).

(2.2.52)

Now fix v T and consider the v-section

W v = {y X : (y, v) W } = {y X : (y, v) = E + 1}.

(2.2.53)

If y, z W v then (y, u) = (z, u) = E + 1. Thus,

w(y u) = w(z u) = E + 1
and
2E + 1 (y, z) = (y v, z v)
= w(y v) + w(z v) 2w((y v) (z v))
= 2E + 2 2w((y v) (z v)).
This implies that
w((y v) (z v)) = 0 and (y, z) = 2E + 2.

F
N
So, y v and z v have no digit 1 in common. Hence, there exist at most
E +1
E
F
N
words of the form y v where y W v , i.e. at most
words in W v . ThereE +1
fore,
E
F
N
W
T .
(2.2.54)
E +1
Collecting (2.2.51), (2.2.52) and (2.2.54) yields inequality (2.2.47).

184

Introduction to Coding Theory

In view of Corollary 2.2.15 the following bound holds true:

N
M (N, 2E + 1) 2 vN (E)

A
B 1
(2.2.55)
N
N E
N E
1

N/(E + 1) E
E +1
E +1

Corollary 2.2.18

Example 2.2.19A Let

and E = 2, i.e. d = 5. Inequality (2.2.41) implies
A NA= 13
BBB
12
11
13
= 23, and the Johnson bound in (2.2.43) yields
W (13, 5, 5)
5 4 3
B
A
213

M (13, 5)
= 77.
1 + 13 + 78 + (286 10 23)/4
This bound is much better than Hammings which gives M (13, 5) 89. In fact, it
is known that M (13, 5) = 64. Compare Section 3.4.

2.3 Linear codes: basic constructions

In this section we explore further the class of linear codes. To start with, we consider binary codes, with digits 0 and 1. Accordingly, HN will denote the binary
Hamming space of length N; words x(N) = x1 . . . xN from HN will be also called
(row) vectors. All operations over binary digits are performed in the binary arithmetic (that is, mod 2). When it does not lead to a confusion, we will omit subscripts N and superscripts (N). Let us repeat the definition of a linear code (cf.
Definition 2.1.5).
Definition 2.3.1 A binary code X HN is called linear if, together with a pair
of vectors, x = x1 . . . xN and x = x1 . . . xN , code X contains the sum x + x , with
digits xi + xi . In other words, a linear code is a linear subspace in HN , over field
F2 = {0, 1}. Consequently, a linear code always contains a zero row-vector 0 =
0 . . . 0. A basis of a linear code X is a maximal linearly independent set of words
from X ; the linear code is generated by its basis in the sense that every vector
x X is (uniquely) represented as a sum of (some) vectors from the basis. All
bases of a given linear code X contain the same number of vectors; the number of
vectors in the basis is called the dimension or the rank of X . A linear code of length
N and rank k is also called an [N, k] code, or an [N, k, d] code if its distance is d.
Practically all codes used in modern practice are linear. They are popular because
they are easy to work with. For example, to identify a linear code it is enough to
fix its basis, which yields a substantial economy as the subsequent material shows.
Lemma 2.3.2
M = 2k .

Any binary linear code of rank k contains 2k vectors, i.e. has size

2.3 Linear codes: basic constructions

185

Proof A basis of the code contains k linearly independent vectors. The code is
generated by the basis; hence it consists of the sums of basic vectors. There are
precisely 2k sums (the number of subsets of {1, . . . , k} indicating the summands),
and they all give different vectors.
Consequently, a binary linear code X of rank k may be used for encoding all
possible source strings of length k; the information rate of a binary linear [N, k]
code is k/N. Thus, indicating k N linearly independent words x HN identifies
a (unique) linear code X HN of rank k. In other words, a linear binary code of
rank k is characterised by a k N matrix of 0s and 1s with linearly independent
rows:

g11 . . . . . . . . . g1N
g21 . . . . . . . . . g2N

..
..

.
.
gk1

...

gkN

Namely, we take the rows g(i) = gi1 . . . giN , 1 i k, as the basic vectors of a
linear code.
Definition 2.3.3 A matrix G is called a generating matrix of a linear code. It is
clear that the generating matrix is not unique.
Equivalently, a linear [N, k] code X may be described as the kernel of a certain
(N k) N matrix H, again with the entries 0 and 1: X = ker H where

h12
...
...
h1N
h11
h21
h22
...
...
h2N

H = .
..
..
..
..
..
.
.
.
.
h(Nk)1 h(Nk)2
and

...

h(Nk)N

)
(
ker H = x = x1 . . . xN : xH T = 0(Nk) .

(2.3.1)

It is plain that the rows h( j), 1 j N k, of matrix H are vectors orthogonal to

X , in the sense of the inner dot-product:
"x h( j)# = 0,

for all x X and 1 j N k.

Here, for x, y HN ,
N

"x y# = "y x# = xi yi , where x = x1 . . . xN , y = y1 . . . yN ;

i=1

cf. Example 2.1.8(ix).

(2.3.2)

186

Introduction to Coding Theory

The inner product (2.3.2) possesses all properties of the Euclidean scalar product
in RN , but one: it is not positive definite (and therefore does not define a norm). That
is, there are non-zero vectors x HN with "x x# = 0. Luckily, we do not need the
positive definiteness.
However, the key ranknullity property holds true for the dot-product: if L is
a linear subspace in HN of rank k then its orthogonal complement L (i.e. the
collection of vectors z HN such that "x z# = 0 for all x L ) is a linear subspace
of rank N k. Thus, the (N k) rows of H can be considered as a basis in X ,
the orthogonal complement to X .
The matrix H (or sometimes its transpose H T ) with the property X = ker H or
"x h( j)# 0 (cf. (2.3.1)) is called a parity-check (or, simply, check) matrix of code
X . In many cases, the description of a code by a check matrix is more convenient
than by a generating one.
The parity-check matrix is again not unique as the basis in X can be chosen
non-uniquely. In addition, in some situations where a family of codes is considered, of varying length N, it is more natural to identify a check matrix where the
number of rows can be greater than N k (but some of these rows will be linearly
dependent); such examples appear in Chapter 3. However, for the time being we
will think of H as an (N k) N matrix with linearly independent rows.
Worked Example 2.3.4 Let X be a binary linear [N, k, d] code of information
rate = k/N . Let G and H be, respectively, the generating and parity-check matrices of X . In this example we refer to constructions introduced in Example 2.1.8.
(a) The parity-check extension of X is a binary code X + of length N +1 obtained
by adding, to each codeword x X , the symbol xN+1 = xi so that the
1iN

sum

1iN+1

xi is zero. Prove that X + is a linear code and find its rank and

minimal distance. How are the information rates and generating and paritycheck matrices of X and X + related?
(b) The truncation X of X is defined as a linear code of length N 1 obtained
by omitting the last symbol of each codeword x X . Suppose that code X
has distance d 2. Prove that X is linear and find its rank and generating
and parity-check matrices. Show that the minimal distance of X is at least
d 1.
(c) The m-repetition of X is a code X re (m) of length Nm obtained by repeating each codeword x X a total of m times. Prove that X re (m) is a linear
code and find its rank and minimal distance. How are the information rates and
generating and parity-check matrices of X re (m) related to , G and H ?

2.3 Linear codes: basic constructions

Solution (a) The generating and parity-check matrices are

|
g1i

1iN

..
H
+

|
.

G =
G
,H =
..

|
.

|
gki

1iN
0 ... 0

187

|
|
|
|
|
|
|

1
1

The rank of X + equals the rank of X = k. If the minimal distance of X was even

it is not changed; if odd it increases by 1. The information rate + = (N 1) N.
(b) The generating matrix

g11 . . . g1N1

..
G =
.

.
gk1 . . . gkN1
The parity-check matrix H of X , after suitable column operations, may be written
as

H =
|
.

|
0 ... 0
|
The parity-check matrix of X is then identified with H . The rank is unchanged;

the distance may decrease maximum by 1. The information rate = N (N 1).
(c) The generating and parity-check matrices are
Gre (m) = (G . . . G) (m times),
and

H (m) =

0
0
I
..
.

...
...
...
..
.

H
I
I
..
.

0
I
0
..
.

0
0
0
..
.

0 0 ... I

188

Introduction to Coding Theory

Here, I is a unit N N matrix and the zeros mean the zero matrices (of size (N
k) N and N N, accordingly). The number of the unit matrices in the first column
equals m 1. (This is not a unique form of H re (m).) The size of H re (m) is (Nm
k) Nm.
The rank is unchanged, the minimal distance in X re (m) is md and the information rate /m.
Worked Example 2.3.5 A dual code of a linear binary [N, k] code X is defined
as the set X of the words y = y1 . . . yN such that the dot-product
"y x# =

yi xi = 0

for every x = x1 . . . xN from X .

1iN

Compare Example 2.1.8(ix). Prove that an (N k) N matrix H is a parity-check

matrix of code X iff H is a generating matrix for the dual code. Hence, derive that
G and H are generating and parity-check matrices, respectively, for a linear code
iff:
(i) the rows of G are linearly independent;
(ii) the columns of H are linearly independent;
(iii) the number of rows of G plus the number of rows of H equals the number of
columns of G which equals the number of columns of H ;
(iv) GH T = 0.
Solution The rows h( j), j = 1, . . . , N k, of the matrix H obey "x h( j)# 0,
x X . Furthermore, if a vector y obeys "x y# 0, x X , then y is a linear
combination of the y( j). Hence, H is a generating matrix of X . On the other
hand, any generating matrix of X is a parity-check matrix for X .
Therefore, for any pair G, H representing generating and parity-check matrices
of a linear code, (i), (ii) and (iv) hold by definition, and (iii) comes from the rank
nullity formula
N = dim(Row Range G) + dim(Row Range H)
that follows from (iv) and the maximality of G and H.
On the other hand, any pair G, H of matrices obeying (i)(iv) possesses the maximality property (by virtue of (i)(iii)) and the orthogonality property (iv). Thus,
they are generating and parity-check matrices for X = Row Range G.
Worked Example 2.3.6 What is the number of codewords in a linear binary
[N, k] code? What is the number of different bases in it? Calculate the last number
for k = 4. List all bases for k = 2 and k = 3.

2.3 Linear codes: basic constructions

189

Show that the subset of a linear binary code consisting of all words of even
weight is a linear code. Prove that, for d even, if there exists a linear [N, k, d] code
then there exists a linear [N, k, d] code with codewords of even weight.

1 k1 k
2 2i . Indeed,

k! i=0
if the l first basis vectors are selected, all their 2l linear combinations should be
excluded on the next step. This gives 840 for k = 4, and 28 for k = 3.
Solution The size is 2k and the number of different bases

Finally, for d even, we can truncate the original code and then use the paritycheck extension.
Example 2.3.7 The binary Hamming [7, 4] code is determined by a 3 7 paritycheck matrix. The columns of the check matrix are all non-zero words of length 3.
Using lexicographical order of these words we obtain

1 0 1 0 1 0 1
Ham
= 0 1 1 0 0 1 1 .
Hlex
0 0 0 1 1 1 1
The corresponding generating matrix may be written as

0 0 1 1 0 0 1
0 1 0 0 1 0 1

.
GHam
lex =
0 0 1 0 1 1 0
1 1 1 0 0 0 0

(2.3.3)

In many cases it is convenient to write the check matrix of a linear [N, k] code in
a canonical (or standard) form:

Hcan = INk H .
(2.3.4a)
In the case of the Hamming [7, 4] code it gives

1 0 0 1 1 0 1
Ham
= 0 1 0 1 0 1 1 ,
Hcan
0 0 1 0 1 1 1
with a generating matrix also in a canonical form:

Gcan = G Ik ;
namely,

1
GHam
can =
1
1

1
0
1
1

0
1
1
1

1
0
0
0

0
1
0
0

0
0
1
0

(2.3.4b)

0
0
.
0
1

190

Introduction to Coding Theory

Formally, Glex and Gcan determine different codes. However, these codes are
equivalent:
Definition 2.3.8 Two codes are called equivalent if they differ only in permutation of digits. For linear codes, equivalence means that their generating matrices
can be transformed into each other by permutation of columns and by rowoperations including addition of columns multiplied by scalars. It is plain that
equivalent codes have the same parameters (length, rank, distance).
In what follows, unless otherwise stated, we do not distinguish between equivalent linear codes.
Remark 2.3.9 An advantage of writing G in a canonical form is that a source
string u(k) Hk is encoded as an N-vector u(k) Gcan ; according to (2.3.4b), the last k
digits in u(k) Gcan form word u(k) (they are called information digits), whereas the
first N k are used for the parity-check (and called parity-check digits). Pictorially,
the parity-check digits carry the redundancy that allows the decoder to detect and
correct errors.

Like following life thro creatures you dissect

You lose it at the moment you detect.
Alexander Pope (16681744), English poet
Definition 2.3.10 The weight w(x) of a binary word x = x1 . . . xN is the number
of the non-zero digits in x:
w(x) = {i : 1 i N, xi = 0}.

(2.3.5)

Theorem 2.3.11
(i) The distance of a linear binary code equals the minimal weight of its non-zero
codewords.
(ii) The distance of a linear binary code equals the minimal number of linearly
dependent columns in the check matrix.
Proof (i) As the code X is linear, the sum x + y X for each pair of codewords
x, y X . Owing to the shift invariance of the Hamming distance (see Lemma
2.1.1), (x, y) = (0, x + y) = w(x + y) for any pair of codewords. Hence, the
minimal distance of X equals the minimal distance between 0 and the rest of the
code, i.e. the minimal weight of a non-zero codeword from X .
(ii) Let X be a linear code with a parity-check matrix H and minimal distance d.
Then there exists a codeword x X with exactly d non-zero digits. Since xH T = 0,

2.3 Linear codes: basic constructions

191

we conclude that there are d columns of H which are linearly dependent (they
correspond to non-zero digits in x). On the other hand, if there exist (d 1) columns
of H which are linearly dependent then their sum is zero. But that means that there
exists a word y, of weight w(y) = d 1, such that yH T = 0. Then y must belong to
X which is impossible, since min[w(x) : x X , x = 0] = d.
Theorem 2.3.12 The Hamming [7, 4] code has minimal distance 3, i.e. it detects
2 errors and corrects 1. Moreover, it is a perfect code correcting a single error.
Proof For any pair of columns the parity-check matrix H lex contains their sum to
obtain a linearly dependent triple (viz. look at columns 1, 6, 7). No two columns
are linearly dependent because they are distinct (x + y = 0 means that x = y). Also,
the volume v7 (1) equals 1 + 7 = 23 , and the code is perfect as its size is 24 and
24 23 = 27 .
The construction of the Hamming [7, 4] code admits a straightforward generalisation to any length N = 2l 1; namely, consider a (2l 1) l matrix H Ham with
columns representing all possible non-zero binary vectors of length l:

1 0 ...
0 1 . . .

H Ham = 0 0 . . .
. . .
.. .. . .
0 0 ...

0 1
0 1
0 0
.. ..
. .
1 0

...
...
...
..
.

1
1

1
.

...

(2.3.6)

The rows of H Ham are linearly independent, and hence H Ham may be considered
as a check matrix of a linear code of length N = 2l 1 and rank N l = 2l
1 l. Any two columns of H Ham are linearly independent but there exist linearly
dependent triples of columns, e.g. x, y and x + y. Hence, the code X Ham with the
check matrix H Ham has a minimal distance 3, i.e. it detects 2 errors and corrects 1.
This code is called the Hamming [2l 1, 2l 1 l] code. It is a perfect one-error
correcting code: the volume of the 1-ball v2l 1 (1) equals 1 + 2l 1 = 2l , and size
2l l 1
l
l
1 as
volume = 22 1l 2l = 22 1 = 2N . The information rate is
2l 1
l . This proves
Theorem 2.3.13 The above construction defines a family of [2l 1, 2l 1
l, 3] linear binary codes X2Ham
l 1 , l = 1, 2, . . ., which are perfect one-error correcting
codes.

192

Introduction to Coding Theory

Example 2.3.14 Suppose that the probability of error in any digit is p 1,

independently of what occurred to other digits. Then the probability of an error in
transmitting a non-encoded (4N)-digit message is
1 (1 p)4N 4N p.
But if we use the [7, 4] code, we need to transmit 7N digits. An erroneous transmission requires at least two wrong digits, which occurs with probability

N
7 2
21N p2 4N p.
1 1
p
2
We see that the extra effort of using 3 check digits in the Hamming code is justified.
A standard decoding procedure for linear codes is based on the concepts of coset
and syndrome. Recall that the ML rule decodes a vector y = y1 . . . yN by the closest
codeword x X .
Definition 2.3.15 Let X be a binary linear code of length N and w = w1 . . . wN
be a word from HN . A coset of X determined by y is the collection of binary
vectors of the form w + x where x X . We denote it by w + X .
An easy (and useful) exercise in linear algebra and counting is
Example 2.3.16

Let X be a linear code and w, v be words of length N. Then:

(1) If w is in the coset v + X , then v is in the coset w + X ; in other words, each

word in a coset determines this coset.
(2) w w + X .
(3) w and v are in the same coset iff w + v X .
(4) Every word of length N belongs to one and only one coset. That is, the cosets
form a partition of the whole Hamming space HN .
(5) All cosets contain the same number of words which equals X . If the rank
of X is k then there are 2Nk different cosets, each containing 2k words. The
code X is itself a coset of any of the codewords.
(6) The coset determined by w + v coincides with the set of elements of the form
x + y, where x w + X , y X + v.
Now the decoding rule for a linear code: you know the code X beforehand, hence
you can calculate all cosets. Upon receiving a word y, you find its coset y + X and
find a word w y + X of least weight. Such a word is called a leader of the coset
y + X . A leader may not be unique: in that case you have to make a choice among
the list of leaders (list decoding) or refuse to decode and demand a re-transmission.
Suppose you have chosen a leader w. You then decode y by the word
x = y + w.

(2.3.7)

2.3 Linear codes: basic constructions

193

Worked Example 2.3.17 Show that word x is always a codeword that minimises the distance between y and the words from X .
Solution As y and w are in the same coset, y + w X (see Example 2.3.16(3)).
All other words from X are obtained as the sums y + v where v runs over coset
y + X . Hence, for any x X ,

(y, x) = w(y + x) min w(v) = w(w) = d(y, x ).

vy+X

The parity-check matrix provides a convenient description of the cosets y + X .

Theorem 2.3.18 Cosets w + X are in one-to-one correspondence with vectors
of the form yH T : two vectors, y and y are in the same coset iff yH T = y H T . In
other words, cosets are identified with the rank (or range) space of the parity-check
matrix.
Proof

The vectors y and y are in the same coset iff y + y X , i.e.

(y + y )H T = yH T + y H T = 0, i.e. yH T = y H T .

In practice, the decoding rule is implemented as follows. Vectors of the form

yH T are called syndromes: for a linear (N, k) code there are 2Nk syndromes. They
are all listed in the syndrome table, and for each syndrome a leader of the corresponding coset is calculated. Upon receiving a word y, you calculate the syndrome
yH T and find, in the syndrome table, the corresponding leader w. Then follow
(2.3.7): decode y by x = y + w.
The procedure described is called syndrome decoding; although it is relatively
simple, one has to write a rather long table of the leaders. Moreover, it is desirable
to make the whole procedure of decoding algorithmically independent on a concrete choice of the code, i.e. of its generating matrix. This goal is achieved in the
case of the Hamming codes:
Theorem 2.3.19 For the Hamming code, for each syndrome the leader w is
unique and
contains
T not more than one non-zero digit. More precisely, if the syndrome y H Ham = s gives column i of the check matrix H Ham then the leader of
the corresponding coset has the only non-zero digit i.
Proof The leader minimises the distance between the received word and the code.
The Hamming code is perfect 1-error correcting. Hence, every word is either a
codeword or within distance 1 of a unique codeword. Hence, the leader is unique

194

Introduction to Coding Theory

and contains at most one non-zero digit. If the syndrome yH T = s occupies position
i among the columns of the parity-check matrix then, for word ei = 0 . . . 1 0 . . . 0
with the only non-zero digit i,
(y + ei )H T = s + s = 0.
That is, (y + ei ) X and ei y + X . Obviously, ei is the leader.

The duals X Ham of binary Hamming codes form a particular class, called simplex codes. If X Ham is [2 1, 2 1 ], its dual (X Ham ) is [2 1, ], and the

original parity-check matrix H Ham serves as a generating matrix for X Ham .

Worked Example 2.3.20 Prove that each non-zero codeword in a binary simplex

code X Ham has weight 21 and the distance between any two codewords equals
21 . Hence justify the term simplex.
Solution If X = X Ham is the binary Hamming [2l 1, 2l l 1] code then the
dual X is [2l 1, l], and its l (2l 1) generating matrix is H Ham . The weight of
any row of H Ham equals 2l1 (and so d(X ) = 2l1 ). Indeed, the weight of row
j of H Ham equals the number of non-zero vectors of length l with 1 at position j.
This gives 2l1 as the weight, as half of all 2l vectors from Hl have 1 at any given
position.

Consider now a general codeword from X Ham . It is represented by the sum of

rows j1 , . . . , js of H Ham where s l and 1 j1 < < js l. This word again has
weight 2l1 ; this gives the number of non-zero words v = v1 . . . vl Hl,2 such that
the sum v j1 + + v js = 1. Moreover, 2l1 gives the weight of half of all vectors in
Hl,2 . Indeed, we require that v j1 + + v js = 1, which results in 2s1 possibilities
for the s involved digits. Next, we impose no restriction on the remaining l s digits
which gives 2ls possibilities. Then 2s1 2ls = 2l1 , as required. So, w(x) = 2l1
for all non-zero x X . Finally, for any x, x X , x = x , the distance (x, x ) =
(0, x + x ) = w(x + x ) which is always equal to 2l1 . So, the codewords x X
form a geometric pattern of a simplex with 2l vertices.
Next, we briefly summarise basic facts about linear codes over a finite-field alphabet Fq = {0, 1, . . . , q 1} of size q = ps . We now switch to the notation FN
q
for the Hamming space HN,q .
Definition 2.3.21 A q-ary code X FN is called linear if, together with a
pair of vectors, x = x1 . . . xN and x = x1 . . . xN , X contains the linear combinations
x + x , with digits xi + xi , for all coefficients , Fq . That is, X is a
linear subspace in FN . Consequently, as in the binary case, a linear code always
contains the vector 0 = 0 . . . 0. A basis of a linear code is again defined as a maximal
linearly independent set of its words; the linear code is generated by its basis in the

2.3 Linear codes: basic constructions

195

sense that every codevector is (uniquely) represented as a linear combination of

the basis codevectors. The number of vectors in the basis is called, as before, the
dimension or the rank of the code; because all bases of a given linear code contain
the same number of vectors, this object is correctly defined. As in the binary case,
the linear code of length N and rank k is referred to as an [N, k] code, or an [N, k, d]
code when its distance equals d.
As in the binary case, the minimal distance of a linear code X equals the minimal non-zero weight:
d(X ) = min [w(x) : x X , x = 0],
where
w(x) = { j : 1 j N, x j = 0 in Fq },
x = x1 . . . xN FN
q .

(2.3.8)

A linear code X is defined by a generating matrix G or a parity-check matrix

H. The generating matrix of a linear [N, k] code is a k N matrix G, with entries
from Fq , whose rows g(i) = gi1 . . . giN , 1 i k, form a basis of X . A paritycheck matrix is an (N k) N matrix H, with entries from Fq , whose rows h( j) =
h j1 . . . h jN , 1 j N k, are linearly independent and dot-orthogonal to X : for
all j = 1, . . . , N k and codeword x = x1 . . . xN from X ,
"x h( j)# =

xl h jl = 0.

1lN

In other words, all qk codewords of X are obtained as linear combinations of

rows of G. That is, subspace X can be viewed as a result of acting on Hamming
k
space Fk
q (of length k) by matrix G: symbolically, X = Fq G. This shows how
k
code X can be used for encoding q messages of length k (and justifies the term
information rate for (X ) = k/N). On the other hand, X is determined as the
kernel (the null-space) of H T : X H T = 0. A useful exercise is to check that for
the dual code, X , the situation is opposite: H is a generating matrix and G the
parity-check. Compare with Worked Example 2.3.5.
Of course, both the generating and parity-check matrices of a given code are not
unique, e.g. we can permute rows g( j) of G or perform row operations, replacing
a row by a linear combination of rows in which the original row enters with a nonzero coefficient. Permuting columns of G gives a different but equivalent code,
whose basic geometric parameters are identical to those of X .

anequivalent code whose generLemma 2.3.22 For any [N, k] code, there exists

ating matrix G has a canonical form: G = G Ik where Ik is the identity k k
matrix and G is an k (N k) matrix. Similarly,
the parity-check matrix H may

have a standard form which is INk H .

196

Introduction to Coding Theory

We now discuss the decoding procedure for a general linear code X of rank
k. As was noted before, it may be used for encoding source messages (strings)
u = u1 . . . uk of length k. The source encoding u Fkq X becomes particularly
simple when the generating and parity-check matrices are used in the canonical (or
standard) form.
Theorem 2.3.23 For any linear code X there exists an equivalent code X with
the generating matrix Gcan and the check matrix H can in standard form (2.3.4a),
(2.3.4b) and G = (H )T .
Proof Assume that code X is non-trivial (i.e. not reduced to the zero word 0).
Write a basis for X and take the corresponding generating matrix G. By performing row-operations (where a pair of rows i and j are exchanged or row i is replaced
by row i plus row j) we can change the basis, but do not change the code. Our
matrix G contains a non-zero column, say l1 : perform row operations to make g1l1
the only non-zero entry in this column. By permuting digits (columns), place column l1 at position N k. Drop row 1 and column N k (i.e. the old column l1 )
and perform a similar procedure with the rest, ending up with the only non-zero
entry g2l2 in a column l2 . Place column l2 at position N k + 1. Continue until an
upper triangular k k submatrix emerges. Further operations may be reduced to
this matrix only. If this matrix is a unit matrix, stop. If not, pick the first column
with more than one non-zero entry. Add the corresponding rows from the bottom
to kill redundant non-zero entries. Repeat until a unit submatrix emerges. Now a
generating matrix is in a standard form, and new code is equivalent to the original
one.
To complete the proof, observe that matrices Gcan and H can figuring in (2.3.4a),
(2.3.4b) with G = (H )T , have k independent rows and N k independent
columns, correspondingly. Besides, the k (N k) matrix Gcan (H can )T vanishes.
In fact,
(Gcan (H can )T )i j = " row i of G column j of (H )T # = g i j g i j = 0.
Hence, H can is a check matrix for Gcan .
Returning to source encoding, select the generating matrix in the canonical form
k

Gcan . Then, given a string u = u1 . . . uk , we set x = ui gcan (i), where gcan (i) repi=1

resents row i of Gcan . The last k digits in x give string u; they are called the information digits. The first N k digits are used to ensure that x X ; they are called
the parity-check digits.
The standard form is convenient because in the above representation X = Fk G,
the initial (N k) string of each codeword is used for encoding (enabling the

2.3 Linear codes: basic constructions

197

detection and correction of errors), and the final k string yields the message from
Fk
q . As in the binary case, the parity-check matrix H satisfies Theorem 2.3.11. In
particular, the minimal distance of a code equals the minimal number of linearly
dependent columns in its parity-check matrix H .
Definition 2.3.24 Given an [N, k] linear q-ary code X with parity-check matrix
T
k
H, the syndrome of an N vector y FN
q is the k vector yH Fq , and the synN
T
drome subspace is the image Fq H . A coset of X by vector w FN
q is denoted
by w + X and formed by words of the form w + x where x X . All cosets have
the same number of elements equal to qk and partition the whole Hamming space
into qNk disjoint subsets; code X is one of them. The cosets are in oneFN
q
to-one correspondence with syndromes yH T . The syndrome decoding procedure is
carried as in the binary case: a received vector y is decoded by x = y + w where
w is the leader of coset y + X (i.e. the word from y + X with minimum weight).
All drawbacks we had in the case of binary syndrome decoding persist in the
general q-ary case, too (and in fact are more pronounced): the coset tables are
bulky, the leader of a coset may be not unique. However, for q-ary Hamming codes
the syndrome decoding procedure works well, as we will see in Section 2.4.
In the case of linear codes, some of the bounds can be improved (or rather new
bounds can be produced).
Worked Example 2.3.25

Let X be a binary linear [N, k, d] code.

(a) Fix a codeword x X with exactly d non-zero digits. Prove that truncating

X on the non-zero digits of x produces a code XNd
of length N d , rank

k 1 and distance d for some d d/2.
(b) Deduce the Griesmer bound improving the Singleton bound (2.1.12):
E F
d
N d+
.
(2.3.9)

1k1 2
Solution (a) Without loss of generality, assume that the non-zero digits in x are

with
x1 = = xd = 1. Truncating on digits 1, . . ., d will produce the code XNd
the rank reduced by 1. Indeed, suppose that a linear combination of k 1 vectors
vanishes on positions d + 1, . . . , N. Then on the positions 1, . . . , d all the values
equal either 0s or 1s because d is the minimal distance. But the first case is impossible, unless the vectors are linearly dependent. The second case also leads to
contradiction by adding the string x and obtaining k linearly
E F dependent vectors in
d
and take y X with
the code X . Next, suppose that X has distance d <
2
N

w(y ) = y j = d .
j=d+1

198

Introduction to Coding Theory

d
x
y
y
x^y
y + (x^y)
x + (x^y)
Figure 2.5

Let y X be an inverse image of y under truncation. Referring to (2.1.6b), we

write the following property of the binary wedge-product:

w(y) = w(x y) + w y + (x y) d.
Consequently, we must have that w(x y) > d d/2. See Figure 2.5.
Then
w(x) = w(x y) + w(x + (x y)) = d
implies that w(x + (x y)) < d/2. But this is a contradiction, because
w(x + y) = w(x + (x y)) + w(y + (x y)) < d.
We conclude that d d/2.
(b) Iterating the argument in (a) yields
N d + d1 + + dk1 ,
G
3
4H E F
E
F
d/2
dl1
d
where dl
. With
, we obtain that

2
2
4
E F
d
.
N d+

1k1 2

Concluding this section, we provide a specification of the GV bound for linear

codes.
Theorem 2.3.26 (Gilbert bound) If q = ps is a prime power then for all integers
N and d such that 2 d N/2, there exists a q-ary linear [N, k, d] code with
minimum distance d provided that
qk qN /vN,q (d 1).

(2.3.10)

2.4 The Hamming, Golay and ReedMuller codes

199

Proof Let X be a linear code of maximal rank with distance at least d of maximal
size. If inequality (2.3.10) is violated the union of all Hamming spheres of radius
d 1 centred on codewords cannot cover the whole Hamming space. So, there
must be a point y that is not in any Hamming sphere around a codeword. Then for
any codeword x and any scalar b Fq the vectors y and y + b x are in the same
coset by X . Also y + b x cannot be in any Hamming sphere of radius d 1. The
same is true for x + b y because if it were, then y would be in a Hamming sphere
around another codeword. Here we use the fact that Fq is a field. Then the vector
subspace spanned by X and y is a linear code larger than X and with a minimal
distance at least d. That is a contradiction, which completes the proof.
For example, let q = 2 and N = 10. Then 25 < v10,2 (2) = 56 < 26 . Upon taking
d = 3, the Gilbert bound guarantees the existence of a binary [10, 5] code with
d 3.

2.4 The Hamming, Golay and ReedMuller codes

In this section we systematically study codes with a general finite alphabet Fq of q
elements which is assumed to be a field. Let us repeat that q has to be of the form
ps where p is a prime and s a natural number; the operations of addition (+) and
multiplication () must also be specified. (As was said above, if q = p is prime,
we can think that Fq = {0, 1, . . . , q 1} and addition and multiplication in Fq are
standard, mod q.) See Section 3.1. Correspondingly, the Hamming space HN,q of
length N with digits from Fq is identified, as before, with the Cartesian power FN
q
and inherits the component-wise addition and multiplication by scalars.
q 1
, k = N , and
q1
Ham with alphabet F as follows. (a)
construct the q-ary [N, k, 3] Hamming code XN,q
q
Pick any non-zero q-ary -word h(1) H,q . (b) Pick any non-zero q-ary -word
h(2) H,q that is not a scalar multiple of h(1) . (c) Continue: if h(1) , . . . , h(s) is a
collection of q-ary -words selected so far, pick any non-zero vector h(s+1) H,q
which is not a scalar multiple of h(1) , . . . , h(s) , 1 s N 1. (d) This process ends
up with a selection of N vectors h(1) , . . . , h(N) ; form an N matrix H Ham with
T
T
Ham FN is defined by the parity-check
the columns h(1) , . . . , h(N) . Code XN,q
q
matrix H Ham . [In fact, we deal with the whole family of equivalent codes here,
modulo choices of words h( j) , 1 j N.]
Definition 2.4.1

Given positive integers q, 2, set N =

For brevity, we will now write X H and H H (or even simply H when possible)
Ham and H Ham . In the binary case (with q = 2), matrix H H is cominstead of XN,q
posed by all non-zero binary column-vectors of length . For general q we have

200

Introduction to Coding Theory

to exclude columns that are multiples of each other. To this end, we can choose
as columns all non-zero -words that have 1 in their top-most non-0 component.
q 1
. Next, as in
Such columns are linearly independent, and their total equals
q1
the binary case, one can arrange words with digits from Fq in the lexicographic
order. By construction, any two columns of H H are linearly
independent, but there

exist triples of linearly dependent columns. Hence, d X H = 3, and X H detects
two errors and corrects one. Furthermore, X H is a perfect code correcting a single
error, as

q 1
k
M(1 + (q 1)N) = q 1 + (q 1)
= qk+ = qN .
q1
As in the binary case, the general Hamming codes admit an efficient
(and elH
has been
egant) decoding procedure. Suppose a parity-check matrix H = H
we
calculate
the synconstructed as above. Upon receiving a word y FN
q
T

T
drome yH Fq . If yH = 0 then y is a codeword. Otherwise, the columnvector HyT is a scalar multiple of a column h( j) of H: HyT = a h( j) , for some
j = 1, . . . , N and a Fq \ {0}. In other words, yH T = a e( j)H T where word
e( j) = 0 . . . 1 . . . 0 HN,q (with the jth digit 1, all others 0). Then we decode y
by x = y a e( j), i.e. simply change digit y j in y to y j a.
Summarising, we have the following
Theorem 2.4.2 The q-ary Hamming codes form a family of

q 1 q 1
q 1
,
, 3 perfect codes XNH , for N =
, = 1, 2, . . . ,
q1 q1
q1

correcting one error, with a decoding rule that changes the digit y j to y j a in a
received word y = y1 . . . yN FN
q , where 1 j N , and a Fq \ {0} are determined from the condition that HyT = a h( j), the a-multiple of column j of the
parity-check matrix H .
Hamming codes were discovered by R. Hamming and M. Golay in the late
1940s. At that time Hamming, an electrical engineer turned computer scientist
during the Jurassic computers era, was working at Los Alamos (as an intellectual janitor to local nuclear physicists, in his own words). This discovery shaped
the theory of codes for more than two decades: people worked hard to extend properties of Hamming codes to wider classes of codes (with variable success). Most
of the topics on codes discussed in this book are related, in one way or another, to
Hamming codes. Richard Hamming was not only an outstanding scientist but also
an illustrious personality; his writings (and accounts of his life) are entertaining
and thought-provoking.

2.4 The Hamming, Golay and ReedMuller codes

201

Until the late 1950s, the Hamming codes were a unique family of codes existing in dimensions N , with regular properties. It was then discovered that
these codes have a deep algebraic background. The development of the algebraic
methods based on these observations is still a dominant theme in modern coding
theory.
Another important example is the four Golay codes (two binary and two ternary).
Marcel Golay (19021989) was a Swiss electrical engineer who lived and worked
in the USA for a long time. He had an extraordinary ability to see the discrete
geometry of the Hamming spaces and guess the construction of various codes
without bothering about proofs.
matrix G =
The binary Golay code X24Gol is a [24, 12] code withthe generating

(I12 |G ) where I12 is a 12 12 identity matrix, and G

form:

0 1 1 1 1 1 1 1 1
1 1 1 0 1 1 1 0 0

1 1 0 1 1 1 0 0 0

1 0 1 1 1 0 0 0 1

1 1 1 1 0 0 0 1 0

1 1 1 0 0 0 1 0 1

G =
1 1 0 0 0 1 0 1 1

1 0 0 0 1 0 1 1 0

1 0 0 1 0 1 1 0 1

1 0 1 0 1 1 0 1 1

1 1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1 0

= G (2) has the following

1
0
1
0
1
1
0
1
1
1
0
0

1
1
0
1
1
0
1
1
1
0
0
0

1
0
1
1
0
1
1
1
0
0
0
1

(2.4.1)

The rule of forming matrix G is ad hoc (and this is how it was determined by M.
Golay in 1949). There will be further ad hoc arguments in the analysis of Golay
codes.
Remark 2.4.3 Interestingly, there is a systematic way of constructing all codewords of X24Gol (or its equivalent) by fitting together two versions of Hamming [7, 4]
code X7H . First, observe that reversing the order of all the digits of a Hamming
code X7H yields an equivalent code which we denote by X7K . Then add a paritycheck to both X7H and X7K , producing codes X8H,+ and X8K,+ . Finally, select
two different words a, b X8H,+ and a word x X8K,+ . Then all 212 codewords
of X24Gol of length 24 could be written as concatenation (a + x)(b + x)(a + b + x).
This can be checked by inspection of generating matrices.

Lemma 2.4.4 The binary Golay code X24Gol is self-dual, with X24Gol = X24Gol .
The code X24Gol is also generated by the matrix G = (G |I12 ).

202

Introduction to Coding Theory

Proof A direct calculation shows that any two rows of matrix G are dot

orthogonal. Thus X24Gol X24Gol . But the dimensions of X24Gol and X24Gol

coincide. Hence, X24Gol = X24Gol . The last assertion of the lemma now follows
from the property (G )T = G .
Worked Example 2.4.5

Show that the distance d(X24Gol ) = 8.

Solution First, we check that for all x X24Gol the weight w(x) is divisible by 4 .
This is true for every row of G = (I12 |G ): the number of 1s is either 12 or 8. Next,
for all binary N-words x, x ,
w(x + x ) = w(x) + w(x ) 2w(x x )
where (x y) is the wedge-product, with digits (x y) j = min(x j , y j ), 1 j N
(cf. (2.1.6b)). But for any pair g( j), g( j ) of the rows of G, w(g( j) g( j )) = 0
mod 2. So, 4 divides w(x) for all x X24Gol .
On the other hand, X24Gol does not have codewords of weight 4. To prove this,
compare two generating matrices, (I12 |G ) and ((G )T |I12 ). If x X24Gol has w(x) =
4, write x as a concatenation xL xR . Any non-trivial sum of rows of (I12 |G ) has
the weight of the L-half of at least 1, so w(xL ) 1. Similarly, w(xR ) 1. But if
w(xL ) = 1 then x must be one of the rows of (I12 |G), none of which has weight
w(xR ) = 3. Hence, w(xL ) 2. Similarly, w(xR ) 2. But then the only possibility
is that w(xL ) = w(xR ) = 2 which is impossible by a direct check. Thus, w(x) 8.
But (I12 |G ) has rows of weight 8. So, d(X24Gol ) = 8.
When we truncate X24Gol at any digit, we get X23Gol , a [23, 12, 7] code. This code
is perfect 3 error correcting. We recover X24Gol from X23Gol by adding a paritycheck.
The Hamming [2 1, 2 1, 3] and the Golay [23, 12, 7] are the only possible
perfect binary linear codes.

Gol of length 12 has the generating matrix I |G
The ternary Golay code X12,3
6 (3)
where

0 1 1 1 1 1
1 0 1 2 2 1

1
1
0
1
2
2

(2.4.2)
G (3) =
, with (G (3) )T = G (3) .
1 2 1 0 1 2

1 2 2 1 0 1
1 1 2 2 1 0
Gol is a truncation of X Gol at the last digit.
The ternary Golay code X11,3
12,3

2.4 The Hamming, Golay and ReedMuller codes

Theorem 2.4.6

(Gol)

The ternary Golay code X12,3

203

(Gol)

= X12,3 is [12, 6, 6]. The code

(Gol)

X11,3 is [11, 6, 5], hence perfect.

Proof

The code [11, 6, 5] is perfect since v11,3 (2) = 1 + 11 2 +

11 10
22 =
2

35 . The rest of the assertions of the theorem are left as an exercise.

3 1
, 3 1 , 3 and the Golay [11, 6, 5] codes are the only
The Hamming
2
possible perfect ternary linear codes. Moreover, the Hamming and Golay are the
only perfect linear codes, occurring in any alphabet Fq where q = ps is a prime
power. Hence, these codes are the only possible perfect linear codes. And even
non-linear perfect codes do not bring anything essentially new: they all have the
same parameters (length, size and distance) as the Hamming and Golay codes. The
Golay codes were used in the 1980s in the American Voyager spacecraft program,
to transmit close-up photographs of Jupiter and Saturn.
The next popular examples are the ReedMuller codes. For N = 2m consider
binary Hamming spaces Hm,2 and HN,2 . Let M(= Mm ) be an m N matrix where
the columns are the binary representations of the integers j = 0, 1, . . . , N 1, with
the least significant bit in the first place:
j = j1 20 + j2 21 + + jm 2m1 .

(2.4.3)

So,

0 1 2
0
0
..
.

1
0
..
.

0
1
..
.

0 0 0
0 0 0

. . . 2m 1

... 1
... 1

..
..
. .

... 1
... 1

v(1)
v(2)
..
.

(2.4.4)

v(m1)
v(m)

The columns of M list all vectors from Hm,2 and the rows are vectors from HN,2
denoted by v(1) , . . . , v(m) . In particular, v(m) has the first 2m1 entries 0, the last
2m1 entries 1. To pass from Mm to Mm1 , one drops the last row and takes one of
the two identical halves of the remaining (m 1) N matrix. Conversely, to pass
from Mm1 to Mm , one concatenates two copies of Mm1 and adds row v(m) :

Mm1 Mm1
.
(2.4.5)
Mm =
0...0 1...1

204

Introduction to Coding Theory

Consider the columns w(1) , . . . , w(m) of Mm corresponding to numbers

1, 2, 4, . . . , 2m1 . They form the standard basis in Hm,2 :
0 ... 0
1 ... 0
.. . . .. .
. .
.
0 0 ... 1
1
0
..
.

Then the column at position j =

ji 2i1 is

1im

ji w(i) .

1im

The vector v(i) , i = 1, . . . , m, can be interpreted as the indicator function of the

set Ai Hm,2 where the ith digit is 1:
Ai = {j Hm,2 : ji = 1}.

(2.4.6)

In terms of the wedge-product (cf. (2.1.6b)) v(i1 ) v(i2 ) v(ik ) is the indicator
function of the intersection Ai(1) Ai(k) . If all i1 , . . . , ik are distinct, the cardinality (1 jk Ai( j) ) = 2mk . In other words, we have the following.
Lemma 2.4.7

The weight w(1 jk v(i j ) ) = 2mk .

An important fact is
Theorem 2.4.8 The vectors v(0) = 11 . . . 1 and 1 jk v(i j ) , 1 i1 < < ik m,
k = 1, . . . , m, form a basis in HN,2 .
Proof It suffices to check that the standard basis N-words e( j) = 0 . . . 1 . . . 0 (1 in
position j, 0 elsewhere) can be written as linear combinations of the above vectors.
But
(i)

e( j) = 1im (v(i) + (1 + v j )v(0) ),

0 j N 1.

(2.4.7)

[All factors in position j are equal to 1 and at least one factor in any position l = j
is equal to 0.]

2.4 The Hamming, Golay and ReedMuller codes

Example 2.4.9

205

For m = 4, N = 16,
v(0) = 1111111111111111
v(1) = 0101010101010101
v(2) = 0011001100110011
v(3) = 0000111100001111
v(4) = 0000000011111111
v(1) v(2) = 0001000100010001
v(1) v(3) = 0000010100000101
v(1) v(4) = 0000000001010101
v(2) v(3) = 0000001100000011
v(2) v(4) = 0000000000110011
v(3) v(4) = 0000000000001111
v(1) v(2) v(3) = 0000000100000001
v(1) v(2) v(4) = 0000000000010001
v(1) v(3) v(4) = 0000000000000101
v(2) v(3) v(4) = 0000000000000011
v(1) v(2) v(3) v(4) = 0000000000000001

Definition 2.4.10 Given 0 r m, the ReedMuller (RM) code X RM (r, m)

of order r is a binary code of length N = 2m spanned by all wedge-products
1 jk v(i j ) and v(0) where 1 k r and 1 i1 < < ik m. The rank of

m
m
X RM (r, m) equals 1 +
++
.
1
r
So, X RM (0, m) X RM (1, m) X RM (m 1, m) X RM (m, m).
Here X RM (m, m) = HN,2 , the whole Hamming space, and X RM (0, m) =
{00 . . . 00, 11 . . . 1}, the repetition code. Next, X RM (m 1, m) consists of all words
x HN,2 of even weight (shortly: even words). In fact, any basis vector is even, by
Lemma 2.4.7. Further, if x, x are even then
w(x + x ) = w(x) + w(x ) 2w(x x )
is again even. Hence, all codewords x X RM (m 1, m) are even. Finally,
dim X RM (m 1, m) = N 1 coincides with the dimension of the subspace of even
words. This proves the claim. As X RM (r, m) X RM (m 1, m), r m 1, any
RM code consists of even words.

206

Introduction to Coding Theory

The dual code is X RM (r, m) = X RM (m r 1, m). Indeed, if a X RM (r, m),

b X RM (m r 1, m) then the wedge-product a b is an even word, and hence
the dot-product "a b# = 0. But
dim(X RM (r, m)) + dim(X RM (m r 1, m)) = N,
hence the claim. As a corollary, code X RM (m 2, m) is the parity-check extension
of the Hamming code.
By definition, codewords x X RM (r, m) are associated with -polynomials in
idempotent variables v(1) , . . . , v(m) , with coefficients 0, 1, of degrees r (here,
the degree of a polynomial is counted by taking the maximal number of variables
v(1) , . . . , v(m) in the summand monomials). The 0-degree monomial in such a polynomial is proportional to v(0) .
Write this correspondence as
x X RM (r, m) px (v(1) , . . . , v(m) ), deg px r.

(2.4.8)

Each such polynomial can be written in the form

px (v(1) , . . . , v(m) ) = v(m) q(v(1) , . . . , v(m1) ) + l(v(1) , . . . , v(m1) ),
with deg q r 1, deg l r. The word v(m) q(v(1) , . . . , v(m1) ) has zeros on the
first 2m1 positions.
By the same token, as above,
q(v(1) , . . . , v(m1) ) b X RM (r 1, m 1),
l(v(1) , . . . , v(m1) ) a X RM (r, m 1).

(2.4.9)

Furthermore, 2m -word x can be written as the sum of concatenated 2m1 -words:

x = (a|a) + (0|b) = (a|a + b).

(2.4.10)

This means that the ReedMuller codes are related via the bar-product construction
(cf. Example 2.1.8(viii)):
X RM (r, m) = X RM (r, m 1)|X RM (r 1, m 1).

(2.4.11)

Therefore, inductively,
d(X RM (r, m)) = 2mr .

(2.4.12)

In fact, for m = r = 0, d(X RM (0, 0)) = 2m and for all m, d(X RM (m, m)) =
1 = 20 . Assume d(X RM (r 1, m)) = 2mr+1 for all m r 1, and
d(X RM (r 1, m 1)) = 2mr . Then (cf. (2.4.14) below)
d(X RM (r, m)) = min[2d(X RM (r, m 1)), d(X RM (r 1, m 1))]
= min[2 2m1r , 2m1r+1 ] = 2mr .
(2.4.13)

2.4 The Hamming, Golay and ReedMuller codes

207

Summarising,

TheRMcode X RM (r, m), 0 r m, is a binary code of length

N
N = 2m , rank k =
and distance d = 2mr . Furthermore,
0lr l

Theorem 2.4.11

(1) X RM (0, m) = {0 . . . 0, 1 . . . 1} X RM (1, m) X RM (m 1, m)

X RM (m, m) = HN,2 ; X RM (m 1, m) is the set of all even N -words
and X RM (m 2, m) the parity-check extension of the Hamming [2m 1,
2m 1 m] code.
(2) X RM (r, m) = X RM (r, m 1)|X RM (r 1, m 1), 1 r m 1.
(3) X RM (r, m) = X RM (m r 1, m), 0 r m 1.
Worked Example 2.4.12 Define the bar-product X1 |X2 of binary linear codes
X1 and X2 , where X2 is a subcode of X1 . Relate the rank and minimum distance
of X1 |X2 to those of X1 and X2 . Show that if X denotes the dual code of X ,
then
(X1 |X2 ) = X2 |X1 .

Using the bar-product construction, or otherwise, define the ReedMuller code

X RM (r, m) for 0 r m. Show that if 0 r m 1, then the dual of X RM (r, m)
is again a ReedMuller code.
Solution The bar-product X1 |X2 of two linear codes X1 X2 FN2 is defined as
*
+
X1 |X2 = (x|x + y) : x X1 , y X2 ;
it is a linear code of length 2N. If X1 has basis x1 , . . . , xk and X2 basis y1 , . . . , yl
then X1 |X2 has basis
(x1 |x1 ), . . . , (xk |xk ), (0, y1 ), . . . , (0|yl ),
and the rank of X1 |X2 equals the sum of ranks of X1 and X2 .
Next, we are going to check that the minimum distance

d X1 |X2 = min 2d(X1 ), d(X2 ) .

(2.4.14)

Indeed, let 0 = (x|x + y) X1 |X2 . If y = 0 then the weight w(x|x + y) w(y)

d(X2 ). If y = 0 then w(x|x + y) = 2w(x) 2d(X1 ). This implies that

(2.4.15)
d X1 |X2 min 2d(X1 ), d(X2 ) .

On the other hand, if x X1 has w(x) = d(X1 ) then d X
1 |X2 w(x|x) =
2d(X1 ). Finally, if y X2 has w(y) = d(X2 ) then d X1 |X2 w(0|y) = d(X2 ).
We conclude that

(2.4.16)
d X1 |X2 min 2d(X1 ), d(X2 ) ,
proving (2.4.14).

208

Introduction to Coding Theory

Now, we will check that

X2 |X1 (X1 |X2 ) .

X2 |X1 = (X1 |X2 ) .

(2.4.17)

Turning to the RM codes, they are determined as follows:

X RM (0, m) = the repetition binary code of length N = 2m ,
X RM (m, m) = the whole space HN,2 of length N = 2m ,
X RM (r, m) for 0 < r < m is defined recursively by
X RM (r, m) = X RM (r, m 1)|X (r 1, m 1).

r
m
RM
and the minimum distance 2mr .
By construction, X (r, m) has rank
j
j=0
In particular, X RM (m 1, m) is the parity-check code and hence dual of X (0, m).
We will show that in general, for 0 r m 1,

X RM (r, m) = X RM (m r 1, m).
It is done by induction in m 3. By the above, we can assume that

X RM (r, m 1) = X RM (m r 2, m 1) holds for 0 r < m 1. Then for

0 r < m:

X RM (r, m) = X RM (r, m 1)|X RM (r 1, m 1)

= X RM (r 1, m 1) |X RM (r, m 1)

= X RM (m r 1, m 1)|X RM (m r 2, m 1)
= X RM (m r 1, m).

2.4 The Hamming, Golay and ReedMuller codes

209

Encoding and decoding of RM codes is based on the following observation. By

virtue of (2.4.5), the product v(i1 ) v(ik ) occurs in the expansion for e( j)
(i)
Hm,2 iff v j = 0 for all i
/ {i1 , . . . , ik }.
For 1 i1 < < ik m, define

Definition 2.4.13

C(i1 , . . . , ik ) := the set of all integers j =

ji 2i1

1im

with ji = 0 for i
/ {i1 , . . . , ik }.

(2.4.18)

For an empty set (k = 0), C(0)

/ = {1, . . . , 2m 1}. Furthermore, set
C(i1 , . . . , ik ) + t = { j + t : j C(i1 , . . . , ik )}.

(2.4.19)

Then, again in view of (2.4.5), for all y = y0 . . . yN1 HN,2 ,

0km 1i1 <<ik m

y j v(i1 ) v(ik )

(2.4.20)

jC(i1 ,...,ik )

(for k = 0, take v(0) ).

For encoding a sequence a = a0 . . . ak1 of information symbols from Hk,2 ,

m
m
RM , rewrite it as (a
with k = 1 +
+ +
, with Xr,m
i1 ,...,i ); here
1
r
i1 , . . . , il are the successive positions of the 1s. Then construct a codeword as
RM where
x = (x0 , . . . , xN1 ) Xr,m
x=

0lr 1i1 <<il m

ai1 ,...,il v(i1 ) v(il ) .

(2.4.21)

We see that the information space Hk,2 is embedded into HN,2 , by identifying
entries a j ai1 ,...,il where j = j0 20 + j1 21 + + jm1 2m1 and i1 , . . . , il are the
successive positions of the 1s among j1 , . . . , jm , 1 l r. With such an identification we obtain:
Lemma 2.4.14

For all 0 l m and 1 i1 < < il m,

jC(i1 ,...,il )

x j = ai1 ,...,il , if l r,

(2.4.22)

= 0, if l > r.
Proof

The result follows from (2.4.20).

Lemma 2.4.15
t
/ {i1 , . . . , ir },

For all 1 i1 < < ir m and for any 1 t m such that

ai1 ,...,ir =

jC(i1 ,...,ir )+2t1

x j.

(2.4.23)

210

Introduction to Coding Theory

Proof The proof follows from the fact that C(i1 , . . . , ir ,t) is the disjoint union
C(i1 , . . . , ir )(C(i1 , . . . , ir )+2t1 ) and the equation
x j = 0 (cf. (2.4.19)).

jC(i1 ,...,ir ,t)

Moreover:
Theorem 2.4.16 For any information symbol ai1 ,...,ir corresponding to v(i1 ,...,ir ) ,
we can split the set {0, . . . , N 1} into 2mr disjoint subsets S, each containing 2r
elements, such that, for all such S, ai1 ,...,ir = x j .
jS

Proof The list of sets S begins with C(i1 , . . . , ir ) and continues with (m r) disjoint sets C(i1 , . . . , ir ) + 2t1 where 1 t m, t {i1 , . . . , ir }. Next, we take any
/ Then C(i1 , . . . , ir ,t1 ,t2 ) conpair 1 t1 < t2 m such that {t1 ,t2 } {i1 , . . . , ir } = 0.
t
1
1
tains disjoint sets C(i1 , . . . , ir ), C(i1 , . . . , ir ) + 2
and C(i1 , . . . , ir ) + 2t2 1 , and for
x j , k = 1, 2. Then the same is true for the
each of them, ai1 ,...,ir =

jC(i1 ,...,ir )+2tk 1

remaining sets
C(i1 , . . . , ir ) + 2t1 1 + 2t2 1 = C(i1 , . . . , ir ,t1 ,t2 )\

C(i1 , . . . , ir ) C(i1 , . . . , ir ) + 2t1 1 C(i1 , . . . , ir ) + 2t2 1 ;

(2.4.24)

there are (mr

2 ) of them and they are still disjoint with each other and the previous
ones. The sets (2.4.24) form a further bunch of sets S.
And so on: a general form of set S is
C(i1 , . . . , ir ) + 2t1 1 + + 2ts 1
which is the same as the set-theoretic difference
C(i

1 , . . . , ir ,t1 , . . . ,ts )

2
1
1
t
t

C(i1 , . . . , ir ) + 2 1 + + 2 s
.
\

(2.4.25)

{t1 ,...,ts }{t1 ,...ts }

Here each such set is labelled by a collection {t1 , . . . ,ts } where 0 s m r, t1 <
/ [The union {t1 ,...,t }{t1 ,...ts } in (2.4.25)
< ts and {t1 , . . . ,ts } {i1 , . . . , ir } = 0.
s

is over all (strict) subsets {t1 , . . . ,ts } of {t1 , . . . ,ts }, with t1 < < ts and s =
0, . . . , s 1 (s = 0 gives the empty subset).] The total number of sets C(i1 , . . . , ir )
equals 2mr and each of them has 2r elements by construction.
Theorem 2.4.16 provides a rationale for the so-called majority decoding for the
ReedMuller codes. Namely, upon receiving a word y = (y0 , . . . , yN1 ), produced
RM , we take any 1 i < < i m and consider
from a codeword x Xr,m
1
r
mr
RM , all these sums coincide
above sets S. If y Xr,m
the sums y j along the 2
jC

2.4 The Hamming, Golay and ReedMuller codes

211

and give ai1 ,...,ir . If the

number of errors in y (i.e. the Hamming distance (x , y))
RM
< 2mr1 = d Xr,m
2, the majority of sums will still give a correct ai1 ,...,ir (the
worst case is where each set S contains no or a single error). By varying {i1 , . . . , ir },
RM containing only monomials of degree
we will determine a codeword x(1) Xr,m
RM .
r. Note that x x(1) will be a codeword in Xr1,m
Then y can be reduced to y x(1) . Compared with x x(1) , the reduced

(1)
(1)

mr =
word y x(1)
> will have (x x , y x ) = (x , y) errors, which is < 2
RM ) 2. We can repeat the above procedure and obtain the correct a
d(Xr1,m
i1 ,...,ir1
for any 1 i1 < < ir1 m, etc. At the end, we recover the whole sequence of
information symbols ai1 ,...,ir .

RM < d X RM
2 is
Therefore, any word y HN,2 with distance y, Xr,m
r,m
uniquely decoded.

. . . correct, insert, refine,

enlarge, diminish, interline.
Jonathan Swift (16671745), AngloIrish writer
ReedMuller codes were discovered at the beginning of the 1950s by David
Muller (19242008); Irwin Reed (19232012) proposed the above decoding procedure. In the early 1970s, the RM codes were used to transmit pictures from
space (as far as the Moon) by the spacecrafts. The quality of transmission was then
considered as exceptionally good. However, later on, NASA engineers decided in
favour of the Golay codes while photographing Jupiter and Saturn.
Worked Example 2.4.17 A maximum distance separable (MDS) code was defined earlier as a q-ary linear [N, k, d] code with d = N k + 1 (equality in the
Singleton bound; see Definition 2.1.13).
(a) Prove that X is MDS iff
(i) any N k columns of its parity-check matrix H are linearly independent,
and
(ii) there exist N k + 1 columns of H that are linearly dependent.
(b) Prove that the dual of an MDS code is MDS and deduce that X is MDS iff
any k columns of its generating matrix G are linearly independent and k is the
maximal such number.
(c) Hence prove that when G is written in the standard form (Ik |G ) then X is
MDS iff any square sub-matrix of G is non-singular.
(d) Finally, check that [N, k, d] code X is MDS iff for any d positions 1 i1 <
< id N , there exists a codeword of weight d with non-zero digits at digits
i1 , . . . , id .

212

Introduction to Coding Theory

Solution (a) An MDS [N, k, d] code has d = N k + 1. If a linear code X has

d(X ) = d then any (d 1) columns of its parity-check matrix H are linearly independent, and (d 1) is the maximal number with this property, and vice versa.
So, any (N k) columns are linearly independent and (N k) is the maximal such
number, and vice versa. Equivalently, any (N k) (N k) submatrix of H is
invertible.
(b) Let X be [N, k, d] MDS code with a parity-check matrix H. Then H is a generating matrix for X . Any (N k) (N k) submatrix of H is invertible. Then
any non-trivial combination of rows of H % has N k 1 zero entries, i.e. weight
k +1; the minimal weight is equal to k +1. So, d(X ) = k +1 = N (N k)+1.
As X is [N, N k] code, it is MDS.
Then, clearly, [N, k] code X is MDS iff k is the maximal number l such that any
l columns of its generating matrix G are linearly independent. Equivalently, X is
systematic on any k positions.
(c) Again, let X be [N, k, d] MDS code, and write G = (Ik |G ). Take a (u u)
submatrix Gu of G . By using row and column permutations, we may assume that
Gu occupies the top left corner in G . Then consider the last (k u) columns of Ik
and u columns of G containing Gu ; the corresponding k k matrix is non-singular
and forms a k k submatrix Gk ,

0
Gu
,
Gk =
Iku
with
det Gk = det Gu det Iku = det Gu = 0, by (b).
So, Gu is non-singular. The proof of the inverse statement is similar.
(d) Finally, choose d = N k + 1 digits, say i1 , . . . , id . Consider i1 together with
the remaining digits j1 , . . . , jk1 . Then i1 , j1 , . . . , jk1 are information symbols. So,
there exists a codeword x with digit i1 non-zero and digits j1 , . . . , jk1 zero. Then
x must have digits i1 , . . . , id non-zero.
The converse: consider an (N d + 1) N matrix
G = [INd+1 |E(Nd+1)(d1) ]
where INd+1 is a unit matrix and E is an (N d + 1) (d 1) matrix with all
entries 1 (the unit of F2 ). The rows of G are linearly independent and have weight d,
and for any row there exists a codeword x(i) X with non-zero digits at the same
positions (and, possibly, elsewhere). Then k, the rank of the code, is N d + 1.
Thus, k = N d + 1.

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

213

Worked Example 2.4.18 The MDS codes [N, N, 1], [N, 1, N] and [N, N 1, 2]
always exist and are called trivial. Any [N, k] MDS code with 2 k N 2 is called
non-trivial. Show that there is no non-trivial MDS code over Fq with q k N q.
In particular, there is no non-trivial binary MDS code (which causes a discernible
lack of enthusiasm about binary MDS codes).
Solution Indeed, the [N, N, 1], [N, N 1, 2] and [N, 1, N] codes are MDS. Take
q k N q and assume X is a q-ary MDS. Take its generating matrix G in the
standard form (Ik |G ) where G is k (N k), N k q.
If some entries in a column of G are zero then this column is a linear combination of k 1 columns of Ik1 . This is impossible by (b) in the previous example;
hence G has no 0 entry. Next, assume that the first row of G is 1 . . . 1: otherwise
we can perform scalar multiplication of columns maintaining codes equivalence.
Now take the second row of G : it is of length N k q and has no 0 entry. Then
these must be repeated entries. That is,

1 ... 1 ... 1 ... 1

G = Ik . . . . . . a . . . a . . . . . . , a = 0.
...
...
Then take the codeword
x = row 1 a1 (row 2);
it has w(x) N k 2 + 2 = N k and X cannot be MDS.
By using the dual code, obtain that there exists no non-trivial q-ary MDS code
with k q. Hence, non-trivial MDS code can only have
N q + 1 k or k q 1.
That is, there exists no non-trivial binary MDS code, but there exists a non-trivial
[3, 2, 2] ternary MDS code.
Remark 2.4.19 It is interesting to find, given k and q, the largest value of N
for which there exists a q-ary MDS [N, k] code. We demonstrated that N must be
k + q 1, but computational evidence suggests this value is q + 1.

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

A useful class of linear codes is formed by the so-called cyclic codes (in particular,
the Hamming, Golay and ReedMuller codes are cyclic). Cyclic codes were proposed by Eugene Prange in 1957; their importance was immediately recognised,
and they generated a large literature. But more importantly, the idea of cyclic codes,

214

Introduction to Coding Theory

together with some other sharp observations made at the end of the 1950s, particularly the invention of BCH codes, opened a connection from the theory of linear
codes (which was then at its initial stage) to algebra, particularly to the theory of finite fields. This created algebraic coding theory, a thriving direction in the modern
theory of linear codes.
We begin with binary cyclic codes. The coding and decoding procedures for
binary cyclic codes of length N are based on the related algebra of polynomials
with binary coefficients:
a(X) = a0 + a1 X + + aN1 X N1 , where ak F2 for k = 0, . . . , N 1. (2.5.1)
Such polynomials can be added and multiplied in the usual fashion, except that
X k + X k = 0. This defines a binary polynomial algebra F2 [X]; the operations over
binary polynomials refer to this algebra. The degree deg a(X) of polynomial a(X)
equals the maximal label of its non-zero coefficient. The degree of the zero polynomial is set to be 0. Thus, the representation (2.5.1) covers polynomials of degree
< N.
Theorem 2.5.1

(a) (1 + X)2 = 1 + X 2 (A freshmans dream).

(b) (The division algorithm) Let f (X) and h(X) be two binary polynomials with
h(X) 0. Then there exist unique polynomials g(X) and r(X) such that
f (X) = g(X)h(X) + r(X) with

deg r(X) < deg h(X).

(2.5.2)

The polynomial g(X) is called the ratio, or quotient, and r(X) the remainder.
Proof
(a) The statement follows from the binomial decomposition where all
intermediate terms vanish.
(b) If deg h(X) > deg f (X) we simply set
f (X) = 0 h(X) + f (X).
If deg h(X) deg f (X), we can perform the standard procedure of long division, with the rules of the binary addition and multiplication.
Example 2.5.2

For binary polynomials:

(a) 1 + X + X 3 + X 4 X + X 2 + X 3 = X + X 7 .

(b) 1 + X N = (1 + X) 1 + X + + X N1 .

1 + X + X 2 + X 4 = X 3 + X 4 ; the
(c) The quotient X + X 2 + X 6 + X 7 + X 8
remainder equals X + X 2 + X 3 .

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

215

Definition 2.5.3
Two polynomials, f1 (X) and f2 (X), are called equivalent
mod h(X), or f1 (X) = f2 (X) mod h(X), if their remainders, after division by h(X),
coincide. That is,
fi (X) = gi (X)h(X) + r(X),

i = 1, 2,

and deg r(X) < deg h(X).

Theorem 2.5.4 Addition and multiplication of polynomials respect the equivalence. That is, if
f1 (X) = f2 (X) mod h(X) and

then

p1 (X) = p2 (X) mod h(X),

f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X),

f1 (X)p1 (X) = f2 (X)p2 (X) mod h(X).

Proof

(2.5.3)

(2.5.4)

We have, for i = 1, 2,
fi (X) = gi (X)h(X) + r(X),

pi (X) = qi (X)h(X) + s(X),

with
deg r(X), deg s(X) < deg h(X).
Hence
fi (X) + pi (X) = (gi (X) + qi (X))h(X) + (r(X) + s(X))
with
deg(r(X) + s(X)) max[r(X), s(X)] < deg h(X).
Thus
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X).
Furthermore, for i = 1, 2, the product fi (X)pi (X) is represented as

gi (X)qi (X)h(X) + r(X)qi (X) + s(X)gi (X) h(X) + r(X)s(X).
Hence, the remainder for both polynomials f1 (X)p1 (X) and f2 (X)p2 (X) may come
only from r(X)s(X). Thus it is the same for both of them.
Note that every linear binary code XN corresponds to a set of polynomials, with
coefficients 0, 1, of degree N 1 which is closed under addition mod 2:
a(X) = a0 + a1 X + + aN1 X N1 a(N) = a0 . . . aN1 ,
b(X) = b0 + b1 X + + bN1 X N1 b(N) = b0 . . . bN1 ,
a(X) + b(X) a(N) + b(N) = (a0 + b0 ) . . . (aN1 + bN1 ).

(2.5.5)

216

Introduction to Coding Theory

[The numeration of the digits in a word of length N using 0, . . . , N 1 instead of

1, . . . , N is more convenient.]
We systematically write a(X) X when the word a(N) = a0 . . . aN1 , representing polynomial a(X), belongs to code X .
Definition 2.5.5 Given a binary word a = a0 a1 . . . aN1 , we define the cyclic shift
a as a word aN1 a0 . . . aN2 . A linear binary code X is called cyclic if the cyclic
shift of each codeword is again a codeword.
A straightforward way to form a cyclic code is as follows: take a word a,
then its subsequent cyclic shifts a, 2 a, etc., and finally all sums of the vectors
obtained. Such a construction allows one to build a code from a single word, and
eventually all the properties of the code may be inferred from the properties of
word a. It turns out that every cyclic code may be obtained in such a way: the
corresponding word is called a generator of a cyclic code.
Lemma 2.5.6 A binary linear code X is cyclic iff, for any vector u from a basis
of X , u X .
Proof Each codeword in X is a sum of vectors of the basis, but (u + v) =
u + v; hence the result.
A useful property of a cyclic shift is established below:
Lemma 2.5.7 If the word a corresponds to a polynomial a(X) then the word a
corresponds to Xa(X) mod (1 + X N ).
Proof The relations
Xa(X) = a0 X + a1 X 2 + + aN2 X N1 + aN1 X N
= aN1 + a0 X + a1 X 2 + + aN2 X N1 mod (1 + X N )
mean that the polynomial
aN1 + a0 X + + aN2 X N1
corresponding to a equals Xa(X) mod (1 + X N ).
A similar argument implies that the word 2 a corresponds to X 2 a(X) mod (1 +
X N ), etc. More generally, we have the following.
Example 2.5.8
The inverse cyclic shift 1 : a0 . . . aN2 aN1 {0, 1}N
a1 a2 . . . aN1 a0 acts on polynomials a(X) of degree at least N 1 by

1 a(X) =

1
a(X) + a0 + a0 X N1 .
X

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

217

Theorem 2.5.9 A binary cyclic code contains, with each pair of polynomials
a(X) and b(X), the sum a(X) + b(X) and any polynomial v(X)a(X) mod (1 + X N ).
Proof By linearity the sum a(X) + b(X) X . If v(X) = v0 + v1 X + vN1 X N1
then each polynomial X k a(X) mod (1 + X N ) corresponds to k a and hence belongs
to X . As
v(X)a(X) mod (1 + X N ) =

vi X i a(X) mod (1 + X N ),

i=0

the LHS belongs to X .

In other words, the binary polynomials of degree at most N 1 with the multiplication defined by
a b(X) = a(X)b(X) mod (1 + X N ),

(2.5.6)

and the usual F2 [X]-addition, form a commutative ring, denoted by F2 [X] (1 +
X N ). The binary cyclic codes are precisely the ideals of this ring.
Nk

Theorem 2.5.10

Let g(X) = gi X i be a non-zero polynomial of minimum

i=0

degree in a binary cyclic code X . Then:

(i) g(X) is a unique polynomial of minimal degree;
(ii) the code X has rank k;
(iii) the codewords corresponding to g(X), Xg(X), . . . , X k1 g(X), form a basis in
X ; they are cyclic shifts of word g = g0 . . . gNk 0 . . . 0;
(iv) a(X) X iff a(X) = v(X)g(X) for some polynomial v(x) of degree < k (that
is, g(X) is a divisor of every polynomial from X ).
Nk

Proof

(i) Suppose c(X) = ci X i is another polynomial of minimal degree N k

i=0

in X . Then gNk = cNk = 1, and hence deg(c(X) + g(X)) < N k. But as N k is

the minimal degree, c(X) + g(X) should equal zero. This happens iff g(X) = c(X).
Hence, g(X) is unique.
(ii) follows from (iii).
(iii) Assume that property (iv) holds. Then each polynomial a(X) X has the
form
r

g(X)v(X) = vi X i g(X),
i=1

r < k.

218

Introduction to Coding Theory

Hence, each polynomial a(X) X is a linear combination of polynomials g(X), Xg(X), . . . , X k1 g(X) (all of which belong to X ). On the other
hand, polynomials g(X), Xg(X), . . . , X k1 g(X) have distinct degrees and hence
are linearly independent. Therefore words g, g, . . . , k1 g, corresponding to
g(X), Xg(X), . . . , X k1 g(X), form a basis in X .
(iv) We know that each polynomial a(X) X has degree > deg g(X). By the
division algorithm,
a(X) = v(X)g(X) + r(X).
Here, we must have
deg v(X) < k

and

deg r(X) < deg g(X) = N k.

But then v(X)g(X) belongs to X owing to Theorem 2.5.9 (as v(X)g(X) has degree
N 1, it coincides with v(X)g(X) mod (1 + X N )). Hence,
r(X) = a(X) + v(X)g(X) X
by linearity. As g(X) is a unique polynomial from X of minimum degree, r(X) =
0.
Corollary 2.5.11 Every binary cyclic code is obtained from the codeword corresponding to a polynomial of minimum degree, by cyclic shifts and linear combinations.
Definition 2.5.12 A polynomial g(X) of minimal degree in X is called a minimal degree generator of a (cyclic) binary code X , or briefly a generator of X .
Remark 2.5.13 There may be other polynomials that generate X in the sense
of Corollary 2.5.11. But the minimum degree polynomial is unique.
Theorem 2.5.14 A polynomial g(X) of degree N 1 is the generator of a
binary cyclic code of length N iff g(X) divides 1 + X N . That is,
1 + X N = h(X)g(X)

for some polynomial h(X) (of degree N deg g(X)).

Proof (The only if part.) By the division algorithm,
1 + X N = h(X)g(X) + r(X),

where deg r(X) < deg g(X).

That is,
r(X) = h(X)g(X) + 1 + X N , i.e. r(X) = h(X)g(X) mod (1 + X N ).

(2.5.7)

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

219

By Theorem 2.5.10, r(X) belongs to the cyclic code X generated by g(X). But
g(X) must be the unique polynomial of minimum degree in X . Hence, r(X) = 0
and 1 + X N = h(X)g(X).
(The if part.) Suppose that 1 + X N = h(X)g(X), deg h(X) = N deg g(X).
Consider the set {a(X) : a(X) = u(X)g(X) mod (1 + X N )}, i.e. the principal
ideal in the -multiplication polynomial ring corresponding to g(X). This set
forms a linear code; it contains g(X), Xg(X), . . . , X k1 g(X) where k = deg h(X).
It suffices to prove that X k g(X) also belongs to the set. But X k g(X) = 1 +
k1

X N + h j X j g(X), that is, X k g(X) is equivalent to a linear combination of

j=0

g(X), Xg(X), . . . , X k1 g(X).

Corollary 2.5.15 All cyclic binary codes of length N are in a one-to-one correspondence with the divisors of polynomial 1 + X N .
Hence, the cyclic codes are described through the factorisation of the polynomial
1 + X N . More precisely, we are interested in decomposing 1 + X N into irreducible
factors; combining these factors into products yields all possible cyclic codes of
length N.
Definition 2.5.16 A polynomial a(X) = a0 + a1 X + + aN1 X N1 is called
irreducible if a(X) cannot be written as a product of two polynomials, b(X) and
b (X), with min[deg b(X), deg b (X)] 1.
The importance (and convenience) of irreducible polynomials for describing
cyclic codes is obvious: every generator polynomial of a cyclic code of length N is
a product of irreducible factors of (1 + X N ).
Example 2.5.17

(a) The polynomial 1 + X N has two standard divisors:

1 + X N = (1 + X)(1 + X + + X N1 ).

The first factor 1 + X Kgenerates the binary parity-check code PN =

%
x = x0 . . . xN1 : xi = 0 , whereas polynomial 1 + X + + X N1 (it may be
i

reducible) generates the repetition code RN = { 00 . . . 0, 11 . . . 1 }.

(b) Select the generating and check matrices of the Hamming [7, 4] code in the
lexicographic form. If we re-order the digits x4 x7 x5 x3 x2 x6 x1 (which leads to an
equivalent code) then the rows of the generating matrix become subsequent cyclic
shifts of each other:

1101000
0110100
H

=
Gcycl
0011010
0001101

220

Introduction to Coding Theory

and the cyclic shift of the last row is again in the code:

(0 0 0 1 1 0 1) = (1 0 0 0 1 1 0)
= (1 1 0 1 0 0 0) + (0 1 1 0 1 0 0) + (0 0 1 1 0 1 0).
By Lemma 2.5.6, the code is cyclic. By Theorem 2.5.10(iii), the generating polyH :
nomial g(X) corresponds to the framed part in matrix Gcycl
1101 g(X) = 1 + X + X 3 = the generator.
But a similar argument can be used to show that an equivalent cyclic code is obtained from the word 1011 1 + X 2 + X 3 . There is no contradiction: it was not
claimed that the polynomial ideal of a cyclic code is the principal ideal of a unique
element.
If we choose a different order of the columns in the parity-check matrix, the
code will be equivalent to the original code; that is, the code with the generator
polynomial 1 + X 2 + X 3 is again a Hamming [7, 4] code.
In Problem 2.3 we will check that the Golay [23, 7] code is generated by the
polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 .
Worked Example 2.5.18

Using the factorisation

X 7 + 1 = (X + 1)(X 3 + X + 1)(X 3 + X 2 + 1)

(2.5.8)

in F2 [X], find all cyclic binary codes of length 7. Identify those which are Hamming
codes and their duals.
Solution See the table below.
code X

generator for X

{0, 1}7
parity-check

1
1+X

1 + X7
Xi

0i6

Hamming
Hamming
dual Hamming
dual Hamming
repetition

1+X

+ X3

1 + X2 + X3 + X4

1 + X2 + X3
1 + X2 + X3 + X4
1 + X + X2 + X4
Xi

1 + X + X2 + X4
1 + X + X3
1 + X2 + X3
1+X

1 + X7

0i6

zero

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

221

It is easy to check that all factors in (2.5.8) are irreducible. Any irreducible factor
could be included or not included in decomposition of the generator polynomial.
This argument proves that there exist exactly 8 binary codes in H7,2 as demonstrated in the table.
Example 2.5.19 (a) Polynomials of the first degree, 1 + X and X, are irreducible
(but X does not appear in the decomposition for 1 + X N ). There is one irreducible
binary polynomial of degree 2: 1 + X + X 2 , two of degree 3: 1 + X + X 3 and 1 +
X 2 + X 3 , and three of degree 4:
1 + X + X 4,

1 + X3 + X4

and 1 + X + X 2 + X 3 + X 4 ,

(2.5.9)

each of which appears in the decomposition of 1 + X N for various values of N (see

below). A further distinction is that polynomials 1 + X + X 3 and 1 + X 2 + X 3 are
primitive whereas 1 + X + X 2 + X 3 + X 4 is not; see Example 2.5.34 below and
Sections 3.13.3. On the other hand, polynomials
1 + X 8,

1 + X4 + X6 + X7 + X8

and 1 + X 2 + X 6 + X 8

(2.5.10)

are reducible. The polynomial 1 + X N is always reducible:

1 + X N = (1 + X)(1 + X + + X N1 ).
(b) Generally, the factorisation of polynomial 1 + X N into the irreducible factors
is not easy to achieve. Among the first 13 odd values of N, the list of polynomials
1 + X N which admit only the trivial decomposition into two irreducible factors is
as follows:
1 + X,

1 + X 3,

1 + X 5,

1 + X 11 ,

1 + X 13 .

Further, the polynomial 1 + X 19 admits only a trivial decomposition (1 + X) (1 +

X + + X 18 ), while others have the following factors (the common factor (1 + X)
is omitted):
1 + X 7 : (1 + X + X 3 )(1 + X 2 + X 3 ),
1 + X 9 : (1 + X + X 2 )(1 + X 3 + X 6 ),
1 + X 15 : (1 + X + X 2 )(1 + X + X 4 )
(1 + X 3 + X 4 )(1 + X + X 2 + X 3 + X 4 ),
1 + X 17 : (1 + X 3 + X 4 + X 5 + X 8 )
(1 + X + X 2 + X 4 + X 6 + X 7 + X 8 ),
21
1 + X : (1 + X + X 2 )(1 + X + X 3 )(1 + X 2 + X 3 )
(1 + X + X 2 + X 4 + X 6 )(1 + X 2 + X 4 + X 5 + X 6 ),
23
1 + X : (1 + X + X 5 + X 6 + X 7 + X 9 + X 11 )
(1 + X 2 + X 4 + X 5 + X 6 + X 10 + X 11 ),

222

Introduction to Coding Theory

and
1 + X 25 : (1 + X + X 2 + X 3 + X 4 )(1 + X 5 + X 10 + X 15 + X 20 ).
For N even, 1 + X N can have multiple roots (see Example 2.5.35(c)).
Example 2.5.20 Irreducible polynomials of degree 2 and 3 over the field F3
(that is, from F3 [X]) are as follows. There exist three irreducible polynomials of
degree 2 over F3 : X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2. There exist eight irreducible
polynomials of degree 3 over F3 : X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 +
2X 2 + 2X + 2, X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1 and X 3 + 2X 2 + X + 1.
Cyclic codes admit encoding and decoding procedures in terms of the polynomials. It is convenient to have a generating matrix of a cyclic code X in a form
similar to Gcycl for the Hamming [7, 4] code (see above). That is, we want to find
the basis in X which gives the following picture in the corresponding generating
matrix:

Gcycl =

(2.5.11)

Such a basis is provided by Theorem 2.5.10(iii): take the generator polynomial

g(X) and its multiples:
g(X), Xg(X), . . . , X k1 g(X), deg g(X) = N k.
Symbolically,

Gcycl =

g(X)
Xg(X)
..
.

(2.5.12)

X k1 g(X)
The code has rank k and may be used for encoding words of length k as follows.
Given a word a = a0 . . . ak1 , we form the polynomial a(X) = ai X i and take
0i<k

the product a(X)g(X). It belongs to X by Theorem 2.5.9, and hence defines a

codeword. So all we have to do is to store polynomial g(X): the encoding will
correspond to polynomial multiplication. If encoding is given by multiplication,
decoding must be related to division. Recall that under the geometric decoder, we
decode the received word by the closest codeword in the Hamming distance. Such
a codeword is related to a leader of the corresponding coset: we have seen that the
cosets are in a one-to-one correspondence with the syndrome words of the form

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

223

yH T . In the case of a cyclic code, the syndromes are calculated straightforwardly.

Recall that, if g(X) is a generator polynomial of a cyclic code X and deg g(X) =
N k, then the rank of X equals k, and there must be 2Nk distinct cosets (see
Theorem 2.5.10(v)).
Theorem 2.5.21 The cosets y + X are in a one-to-one correspondence with the
remainders y(X) = u(X) mod g(X). In other words, two words y, y belong to the
same coset iff, in the division algorithm representation,
y(X) = a(X)g(X) + u(X), y (X) = a (X)g(X) + u (X), and u(X) = u (X).
Proof y and y belong to the same coset iff y + y X . This is equivalent to
u(X) + u (X) = 0, i.e. u(X) = u (X) by Theorem 2.5.14.
Hence the cosets are labelled by the polynomials u(X) of deg u(X) < deg g(X) =
N k: there are exactly 2Nk such polynomials. To determine the coset y + X it
is enough to compute the remainder u(X) = y(X) mod g(X). Unfortunately, there
is still a task to find a leader in each case: there is no simple algorithm for finding
leaders, for a general cyclic code. However, there are known particular classes of
cyclic codes which admit a relatively simple decoding: the first such class was
discovered in 1959 and is formed by BCH codes (see Section 2.6).
As was observed, a cyclic code may be generated not only by its polynomial of
minimum degree: for some purposes other polynomials with this property may be
useful. However, they all are divisors of 1 + X N :
Theorem 2.5.22 Let X be a binary cyclic code of length N . Then any polynomial g(X) such that X is the principal ideal of g(X) is a divisor of 1 + X N .
Proof

An exercise from algebra.

We see that the cyclic codes are naturally labelled by their generator polynomials.
Definition 2.5.23 Let X be the cyclic binary code of length N generated by
g(X). The check polynomial h(X) of X is defined as the ratio (1 + X N )/g(X).
That is, h(X) is a unique polynomial for which h(X)g(X) = 1 + X N .
We will use the standard notation gcd( f (X), g(X)) for the greatest common divisor of polynomials f (X) and g(X) and lcm( f (X), g(X)) for their least common
multiple. Denote by X1 + X2 the direct sum of two linear codes X1 , X2 HN,2 .
That is, X1 + X2 consists of the linear combinations 1 a(1) + 2 a(2) where
1 , 2 = 0, 1 and a(i) Xi , i = 1, 2. Compare Example 2.1.8(vii).

224

Introduction to Coding Theory

Worked Example 2.5.24 Let X1 and X2 be two binary cyclic codes of length
N , with generators g1 (X) and g2 (X). Prove that:
(a) X1 X2 iff g2 (X) divides g1 (X);
(b)
X1 X2 yields a cyclic code generated by
the intersection

lcm g1 (X), g2 (X) ;

(c)
the direct
sum X1 + X2 is a cyclic code with the generator
gcd g1 (X), g2 (X) .

Solution (a) We know that a(X) Xi iff, in the ring F2 [X] (1 + X N ), polynomial a(X) = fi gi (X), i = 1, 2. Suppose g2 (X) divides g1 (X) and write g1 (X) =
r(X)g2 (X). Then every polynomial a(X) of the form f1 g1 (X) is of the form
f1 r g2 (X). That is, if a(X) X1 then a(X) X2 , so X1 X2 .
Conversely, suppose that X1 X2 . Let di be the degree of gi (X), 1 di < N,
i = 1, 2, and write
g1 (X) = f (X)g2 (X) + r(X), where deg r(X) < d2 .

We have that every polynomial -divisible by g1 (X) in F2 [X] (1 + X N ) is also divisible by g2 (X). In particular, the basis polynomials X i g1 (X), 0 i N d1 1,
are -divisible by g2 (X), i.e. have the form
X i g1 (X) = h(i) (X)g2 (X) + i (X N 1) where i = 0 or 1.
If, for some i, the coefficient i = 0 then we compare two identities,
X i g1 (X) = X i f (X)g2 (X) + X i r(X) and X i g1 (X) = h(i) (X)g2 (X),
and conclude that X i r(X) = 0. This implies that r(X) = 0 and hence g2 (X) divides
g1 (X).
The remaining case is that all coefficients i 1. Then we compare
Xg1 (X) = Xh(0) (X)g2 (X) + X + X N+1
and
Xg1 (X) = h(1) (X)g2 (X) + 1 + X N
and see that this case is impossible.
(b) This part becomes straightforward: the intersection X1 X2 is a subcode of
both X1 and X2 . It is obviously a cyclic code; hence, by part (a), its generator g(X)
is divisible by both g1 (X) and g2 (X). Then it is divisible by the lcm(g1 (X), g2 (X)).
We must exclude the case where g(X) produces a non-trivial ratio after this division. But the lcm(g1 (X), g2 (X)) is itself a generator of a cyclic code (of the
same original length) contained in both X1 and X2 . So, in the case g(X) =

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

225

lcm(g1 (X), g2 (X)), the code generated by lcm(g1 (X), g2 (X)) must be strictly larger
than X1 X2 . This contradicts the definition of X1 X2 .
(c) Similarly, X1 + X2 is the minimal linear code containing both X1 and X2 .
Hence, its generator divides both g1 (X) and g2 (X), i.e. is their common divisor.
And if it is not equal to the gcd(g1 (X), g2 (X)) then it contradicts the above minimality property.
Worked Example 2.5.25 Let X be a binary cyclic code of length N with the
generator g(X) and the check polynomial h(X). Prove that a(X) X iff the polynomial (1 + X N ) divides a(X)h(X), i.e. a h(X) = 0 in F2 [X]/(1 + X N ).
Solution If a(X) X then a(X) = f (X)g(X) for some polynomial f (X)
F2 [X]/(1 + X N ). Then
a(X)h(X) = f (X)g(X)h(X) = f (X)(1 + X N )
which equals 0 in F2 [X]/(1 + X N ). Conversely, let a(X) F2 [X]/(1 + X N ) and
assume that a(X)h(X) = 0 mod (1 + X N ). Write a(X) = f (X)g(X) + r(X) where
deg r(X) <deg g(X). Then
a(X)h(X) = f (X)(1 + X N ) + r(X)h(X) = r(X)h(X) mod (1 + X N ).
Hence, r(X)h(X) = 0 mod (1 + X N ) which is only possible when r(X) = 0 (since
deg r(X)h(X) < N). Thus, a(X) = f (X)g(X) and a(X) X .
Worked Example 2.5.26
find its generating matrix.

Prove that the dual of a cyclic code is again cyclic and

Solution If y X , the dual code, then the dot-product " x y# = 0 for all x X .
But " x y# = "x y#, i.e. y X , which means that X is cyclic.
Let g(X) = g0 + g1 X + + gNk X Nk be the generating polynomial for X ,
where N k = d is the degree of g(X) and k gives the rank of X . We know that
the generating matrix G of X may be written as

g(X)
Xg(X)

(2.5.13)
G

.

.
.

.
0

X k1 g(X)

226

Introduction to Coding Theory

Take h(X) = (1 + X N )/g(X) and write h(X) = h j X j and h = h0 . . . hN1 . Then

j=0

g j hi j

j=0

= 1,

i = 0, N,

= 0,

1 i < N.

Indeed, for i = 0, N, we have h0 g0 = 1 and hk gNk = 1. For 1 i < N we obtain

that the dot-product

" j g j h # = 0 for j = 0, 1, . . . , N k 1, j = 0, . . . , k 1,
where h = hk hk1 . . . h0 . It is then easy to see that h gives rise to the generator
h (X) of X .
An alternative solution is based on Worked Example 2.5.25. We know that
a(X) X iff a h(X) = 0. Let k be the degree of g(X) then the degree of h(X)
equals N k. The degree deg[a(X)h(X)] is < 2N k, so the coefficients of X Nk ,
X Nk+1 , . . . , X N1 in a(X)h(X) all vanish. That is:
a0 hNk + a1 hNk1 + + aNk h0
a1 hNk + a2 hNk1 + + aNk+1 h0
..
.

= 0,
= 0,
..
.

ak1 hNk + ak hNk1 + + aN1 h0 = 0.

In other words, aH T = 0 where a = a0 , . . . , aN1 is the word of the binary coefficients for a(X) and H is an (N k) N matrix

h (X)

0
(X)

H
(2.5.14)

..
.

.
0
X Nk1 h (X)
and h (X) = X Nk h(X 1 ), with the coefficient string h = hk hk1 . . . h0 .
We conclude that matrix H generates a code X X . But since hNk = 1, the
rank of X equals N k. Hence, X = X .
It remains to check that polynomial h (X) divides 1 + X N . To this end,
we deduce from g(X)h(X) = 1 + X N that h(X 1 )g(X 1 ) = X N + 1. Hence
h (X)X k g(X 1 ) = 1 + X N , and as X k g(X 1 ) equals the polynomial gk + gk1 X +
+ g0 X k , the required fact follows.
2
Worked Example 2.5.27
erator g(X).

Let X be a binary cyclic code of length N with gen-

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

227

(a) Show that the set of codewords a X of even weight is a cyclic code and find
its generator.
(b) Show that X contains a codeword of odd weight iff g(1) = 0 or, equivalently,
word 1 X .
Solution (a) If code X is even (i.e. contains only words of even weight) then
every polynomial a(X) X has a(1) = ai = 0. Hence, a(X) contains a
0i<N1

factor (X + 1). Therefore, the generator g(X) has a factor (X + 1). The converse is
also true: if (X + 1) divides g(X), or, equivalently, g(1) = 0, then every codeword
a X is of even weight.
Now assume that X contains a word with an odd weight, i.e. g(1) = 1; that
is, (1 + X) does not divide g(X). Let X ev be the subcode in X formed by the
even codewords. A cyclic shift does not change the weight, so X ev is a cyclic
code. For the corresponding polynomials a(X) we have, as before, that (1 + X)
divides a(X). Thus, the generator gev (X) of X ev is divisible by (1 + X), hence
gev (X) = g(X)(X + 1).
(b) It remains to show that g(1) = 1 iff the word 1 X . The corresponding polynomial is 1+ + X N1 , the complementary factor to (1+ X) in the decomposition
1 + X N = (1 + X)(1 + + X N1 ). So, if g(1) = 1, i.e. g(X) does not contain the
factor (1 + X), then g(X) must be a divisor of 1 + + X N1 . This implies that
1 X . The inverse statement is established in a similar manner.
Worked Example 2.5.28 Let X be a binary cyclic code of length N with generator g(X) and check polynomial h(X).
(a) Prove that X is self-orthogonal iff h (X) divides g(X) and self-dual iff
h (X) = g(X) where h (X) = hk + hk1 X + + h0 X k1 and h(X) = h0 + +
hk1 X k1 + hk X k is the check polynomial, with g(X)h(X) = 1 + X N .
(b) Let r be a divisor of N : r|N . A binary code X is called r-degenerate if every
codeword a X is a concatenation c . . . c where c is a string of length r. Prove that
X is r-degenerate iff h(X) divides (1 + X r ).
Solution (a) Self-orthogonality means that X X , i.e. "a b# = 0 for all
a, b X . From Worked Example 2.5.26 we know that h (X) gives the generator polynomial of X . Then, by virtue of Worked Example 2.5.26, X X iff
h (X) divides g(X).
Self-duality means that X = X , that is h (X) = g(X).
(b) For N = rs, we have the decomposition
1 + X N = (1 + X r )(1 + X r + + X r(s1) ).

228

Introduction to Coding Theory

Now assume cyclic code X of length N with generator g(X) is r-degenerate. Then
the word g is of the form 1c1c . . . 1c for some string c of length r 1 (with c = 1c).
Let c(X) be the polynomial corresponding to c (of degree r 2). Then g(X) is
given by
1 + X c(X) + X r + X r+1 c(X) + + X r(s1) + X r(s1)+1 c(X)
= (1 + X r + + X r(s1) )[1 + X c(X)].
For the check polynomial h(X) we obtain

>

h(X) = 1 + X N
1 + X r + + X r(s1) 1 + X c(X)

= 1 + Xr
1 + X c(X) ,
i.e. h(X) is a divisor of (1 + X r ).
Conversely, let h(X)|(1 + X r ), with h(X)g(X) = 1 + X r where g(X) =
c j X j , with c0 = 1. Take c = c0 . . . cr1 ; repeating the above argument in
0 jr1

the reverse order, we conclude that the word g is the concatenation c . . . c. Then the
cyclic shift g is the concatenation c(1) . . . c(1) where c(1) = cr1 c0 . . . cr2 (= c,
the cyclic shift of c in {0, 1}r ). Similarly, for subsequent cyclic shift iterations
2 g, . . .. Hence, the basis vectors in X are r-degenerate, and so is the whole of X .
In the standard arithmetic, a (real or complex) polynomial p(X) of a given degree d is conveniently identified through its roots (or zeros) 1 , . . . , d (in general,
complex), by means of the monomial decomposition: p(X) = pd (X i ). In
1id

the binary arithmetic (and, more generally, the q-ary arithmetic), the roots of polynomials are still an extremely useful concept. In our situation, the roots help to
construct the generator polynomial g(X) = gi X i of a binary cyclic code with
0id

important predicted properties. Assume for the moment that the roots 1 , . . . , d of
g(X) are a well-defined object, and the representation
g(X) =

(X i )

1id

has a consistent meaning (which is provided within the framework of finite fields).
Even without knowing the formal theory, we are able to make a couple of helpful
observations.
The first observation is that the i are Nth roots of unity, as they should be among
the zeros of polynomial 1 + X N . Hence, they could be multiplied and inverted, i.e.
would form an Abelian multiplicative group of size N, perhaps cyclic. Second, in
the binary arithmetic, if is a zero of g(X) then so is 2 , as g(X)2 = g(X 2 ). Then
2 is also a zero, as well as 4 , and so on. We conclude that the sequence , 2 , . . .
d
d
begins cycling: 2 = (or 2 1 = 1) where d is the degree of g(X). That is, all

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

229

)
c1
Nth roots of unity split into disjoint classes, of the form C = , 2 , . . . , 2
,
c
of size c where c = c(C ) is a positive integer (with 2 1 dividing N). The notation
C ( ) is instructive, with c = c( ). The members of the same class are said to be
conjugate to each other. If we want a generating polynomial with root then all
conjugate roots of unity C ( ) will also be among the roots of g(X).
Thus, to form a generator g(X) we have to borrow roots from classes C and
enlist, with each borrowed root of unity, all members of their classes. Then, since
any polynomial a(X) from the cyclic code generated by g(X) is a multiple of g(X)
(see Theorem 2.5.10(iv)), the roots of g(X) will be among the roots of a(X). Conversely, if a(X) has roots i of g(X) among its roots then a(X) is in the code. We
see that cyclic codes are conveniently described in terms of roots of unity.
Example 2.5.29 (The Hamming [7, 4] code) Recall that the parity-check matrix
H for the binary Hamming [7, 4] code X H is 3 7; it enlists as its columns all
non-zero binary words of length 3: different orderings of these rows define equivalent codes. Later in this section we explain that the sequence of non-zero binary
words of any given length 2 1 written in some particular order (or orders) can be

interpreted as a sequence of powers of a single element : 0 , , 2 , . . . , 2 2 .
The multiplication rule generating these powers is of a special type (multiplication
of polynomials modulo a particular irreducible polynomial of degree ). To stress
this fact, we use in this section the notation for this multiplication rule, writing
i in place of i . Anyway, for l = 3, one appropriate order of the binary non-zero
3-words (out of the two possible orders) is

0 0 1 0 1 1 1
H = 0 1 0 1 1 1 0 ( 0 2 3 4 5 6 ).
1 0 0 1 0 1 1
Then, with this interpretation, the equation aH T = 0, determining that the word
a = a0 . . . a6 (or its polynomial a(X) = ai X i ) lies in X H , can be rewritten as
0i<7

ai i = 0, or a( ) = 0.

0i<7

In other words, a(X) X H iff is a root of a(X) under the multiplication rule
(which in this case is multiplication of binary polynomials of degree 2 modulo
the polynomial 1 + X + X 3 ).
The last statement can be rephrased in this way: the Hamming [7, 4] code is
equivalent to the cyclic code with the generator g(X) that has among its roots;
in this case the generator g(X) = 1 + X + X 3 , with g( ) = 0 + + 3 = 0.
The alternative ordering of the rows of H H is related in the same fashion to the
polynomial 1 + X 2 + X 3 .

230

Introduction to Coding Theory

We see that the Hamming [7, 4] code is defined by a single root , provided
that we establish proper terms of operation with its powers. For that reason we can
call the defining root (or defining zero) for this code. There are reasons to call
element primitive; cf. Sections 3.13.3.
Worked Example 2.5.30 A code X is called reversible if a = a0 a1 . . . aN1 X
implies that a = aN1 . . . a1 a0 X . Prove that a cyclic code with generator g(X)
is reversible iff g( ) = 0 implies g( 1 ) = 0.
Solution For the generator polynomial g(X) = gi X i , with deg g(X) = d < N
0id

and g0 = gd = 1, the reversed polynomial is grev (X) = X N1 g(X 1 ), so if the cyclic

code X is reversible and is a root of g(X) then is also a root of grev (X). This
is possible only when g( 1 ) = 0.
Conversely, let g(X) satisfy the property that g( ) = 0 implies g( 1 ) = 0.
The above formula holds for all polynomial a(X) of degree < N: arev (X) =
X N1 a(X 1 ). If a(X) X then a( ) = a( 1 ) = 0 for all root of g(X). Then
arev ( ) = arev ( 1 ) = 0 for all roots of g(X). Thus, arev (X) is a multiple of
g(X), and arev (X) X .
The natural framework for studying roots of polynomials is provided by the
theory of finite fields or Galois theory (we have seen already how polynomial fields
can be used). In the rest of this section we give an initial (and brief) introduction
into some aspects of Galois theory to understand better some examples of codes
introduced so far. In Chapter 3 we will dive deeper into the Galois theory to gain
enough knowledge in order to proceed further with code constructions.
Remark 2.5.31 A field is a commutative ring where each non-zero element has
an inverse. In other words, a ring is a field if the multiplication generates a group.
In fact, a multiplication group of non-zero elements of a field is cyclic.
Theorem 2.5.32 Let g(X) F2 [X] be an irreducible binary polynomial of degree d . Then multiplication mod g(X) makes the set of the binary polynomials of
d
degree d 1 (i.e. the space Fd
2 ) a field with 2 elements. Conversely, if the
multiplication mod g(X) leads to a field then g(X) is irreducible.
Proof The only non-trivial property to check is the existence of the inverse element. Take a non-zero polynomial f (X), with deg f (X) d 1, and consider all
polynomials of the form f (X)h(X) (the usual multiplication) where h(X) runs over
the whole set of the polynomials of degree d 1. These products must be distinct
mod g(X). Indeed, if
f (X)h1 (X) = f (X)h2 (X) mod g(X),

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

231

then, for some polynomial v(X) of degree d 2,

f (X)(h1 (X) h2 (X)) = v(X)g(X).

(2.5.15)

This implies that either g(X)| f (X) or g(X)|h1 (X) h2 (X). We conclude that if
polynomial g(X) is irreducible, (2.5.15) is impossible, unless h1 (X) = h2 (X) and
v(X) = 0. For one and only one polynomial h(X), we have
f (X)h(X) = 1 mod g(X);
h(X) represents the inverse for f (X) in multiplication mod g(X). We write h(X) =
f (X)1 .
On the other hand, if g(X) is reducible, then g(X) = b(X)b (X) where both b(X)
and b (X) are non-zero and have degree < d. That is, b(X)b (X) = 0 mod g(X). If
the multiplication mod q led to a field both b(X) and b (X) would have inverses,
b(X)1 and b (X)1 . But then
b(X)1 b(X) b (X) = b (X) = 0,
and similarly b(X) = 0.
A field obtained via the above construction is called a polynomial field and is
often denoted by F2 [X]/"g(X)#. It contains 2d elements where d = deg g(X) (representing polynomials of degree < d). We will call g(X) the core polynomial of the
field. For the rest of this section we denote the multiplication in a given polynomial
field by . The zero polynomial and the unit polynomial are denoted correspondingly, by 0 and 1: they are obviously the zero and the unity of the polynomial field.
A key role is played by the following result.
Theorem 2.5.33 (a) The multiplicative group of non-zero elements in polynomial field F2 [X]/"g(X)# is isomorphic to the cyclic group Z2d 1 of size 2d 1.
(b) The polynomial fields obtained by picking different irreducible polynomials of
degree d are all isomorphic.
Proof We will only prove here assertion (a); assertion (b) will be established in
Section 3.1. Take any element from the field, a(X) F2 [X]/"g(X)#, and observe
that
. . a=(X)
ai (X) := a
: .;<
i times
(the multiplication in the field) takes at most 2d 1 values (the number of elements
in the field less one, as the zero 0 is excluded). Hence there exists a positive integer
r such that ar (X) = 1; the smallest value of r is called the order of a(X).

232

Introduction to Coding Theory

Choose a polynomial a(X) F2 [X]/"g(X)# with the largest order r. Then we

claim that the order of any other element b(X) divides r. Indeed, let s be the order
of b(X). Pick a prime factor p of s and write

s = pc l , and r = pc l,
with integers c , c 0 and l, l 1, where l and l are not divisible by p. We want to
c

show that c c . Indeed, element ap (X) has order l, bl (X) has order pc and the
b

product ap bl (X) has order l pc . Hence, c c or else r would not be maximal.
This is true for any prime p, hence s divides r.
Thus, with r being the maximal order, every element b(X) in the field obeys
r
b (X) = 1. By using the pigeon-hole principle, we conclude that r = 2d 1, the
number of non-zero elements of the field. Hence, with a(X) being an element of
d
order r, the powers 1, a(X), . . . , a(2 1) (X) exhaust the multiplicative groups of
the field.
In the wake of Theorem 2.5.33, we can use the notation F2d for any polynomial
field F2 [X]/"g(X)# where g(X) is an irreducible binary polynomial of degree d.
Further, the multiplicative group of non-zero elements in F2d is denoted by F2d ;
it is cyclic ( Z2d 1 , according to Theorem 2.5.33). Any generator of group F2d
(whose -powers exhaust F2d ) is called a primitive element of field F2d .
Example 2.5.34 We can see the importance of writing down the full list of irreducible polynomials. There are six irreducible binary polynomials of degree 5
(each of which is primitive):
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5,
1 + X2 + X3 + X4 + X5

(2.5.16)

and nine of degree 6 (of which six are primitive):

1 + X + X 6, 1 + X + X 3 + X 4 + X 6, 1 + X 5 + X 6,
1 + X + X 2 + X 5 + X 6, 1 + X 2 + X 3 + X 5 + X 6,
1 + X + X 4 + X 5 + X 6,
2
1 + X + X + X 4 + X 6, 1 + X 2 + X 4 + X 5 + X 6,
1 + X 3 + X 6.

(2.5.17)

The number of irreducible polynomials grows significantly with the degree: there
are 18 of degree 7, 30 of degree 8, and so on. However, there exist and are available
quite extensive tables of irreducible polynomials over various finite fields.

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

X i

1 + X + X3
polynomial

X 0
X
X 2
X 3
X 4
X 5
X 6

0
1
X
X2
1+X
X + X2
1 + X + X2
1 + X2

word

X i

1 + X2 + X3
polynomial

word

000
100
010
001
110
011
111
101

X 0
X
X 2
X 3
X 4
X 5
X 6

0
1
X
X2
1 + X2
1 + X + X2
1+X
X + X2

000
100
010
001
101
111
110
011

233

Figure 2.6

Example 2.5.35 (a) The field F2 [X]/"1 + X + X 2 # has four elements: 0, 1, X,

1 + X, with the multiplication table:
X X = 1 + X,
as X 2 = 1 + X mod (1 + X + X 2 ),
X (1 + X) = X + X X = 1,
(1 + X) (1 + X) = 1 + X + X + X X = 1 + 1 + X = X.
Since X 3 = (1 + X) X = 1, the group is isomorphic to Z3 . An alternative notation
for this field is F4 .
(b) The fields F2 [X]/"1 + X + X 3 # and F2 [X]/"1 + X 2 + X 3 # contain eight elements
each, representing all polynomials of degree 2. Every such polynomial a0 +
a1 X + a2 X 2 is identified via the string of its coefficients a0 a1 a2 (a binary word).
The field tables are found by looking at the subsequent powers X i : see Figure 2.6.
In both cases the multiplicative group of non-zero elements is Z7 . The two fields
are obviously isomorphic, as they share the common multiplicative cyclic group
formalism. The common notation for these fields is F8 . Note that the two field
tables coincide for the powers X i with 0 i < 3; in fact, this is a general pattern:
see Sections 3.13.3.
Moreover, the element X = X 1 F2 [X]/"1 + X + X 3 # can be identified as a root
of the core polynomial 1+X +X 3 and element X = X 1 F2 [X]/"1+X 2 +X 3 # as a
root of 1 + X 2 + X 3 , as these polynomials yield zeros in their respective fields. The
remaining two roots are X 2 and X 4 (again calculated in their respective fields).
Applying this example to the Hamming [7, 4] code (cf. Example 2.5.29), the
field F2 [X]/"1 + X + X 3 # leads to the roots of the generator 1 + X + X 3 , and the
field F2 [X]/"1 + X 2 + X 3 # to those of 1 + X 2 + X 3 . That is, the Hamming [7, 4]
code is equivalent to the cyclic code of length 7 with the defining root = X in
either of the two isomorphic fields F2 [X]/"1 + X + X 3 # or F2 [X]/"1 + X 2 + X 3 #.
With some ambiguity (which will be removed in Section 3.1) we may say that this
code is defined by its root which is a primitive element of F8 .

234

Introduction to Coding Theory

powers X i

polynomials

X 0
X
X 2
X 3
X 4
X 5
X 6
X 7
X 8
X 9
X 10
X 11
X 12
X 13
X 14

0
1
X
X2
X3
1+X
X + X2
X2 + X3
1 + X + X3
1 + X2
X + X3
1 + X + X2
X + X2 + X3
1 + X + X2 + X3
1 + X2 + X3
1 + X3

coefficient
strings
0000
1000
0100
0010
0001
1100
0110
0011
1101
1010
0101
1110
0111
1111
1011
1001

Figure 2.7

(c) The field F2 [X]/"1 + X + X 4 # contains 16 elements. The field table is given in
Figure 2.7. In this case, the multiplicative group is Z15 , and the field can be denoted
by F16 . As above, element X F2 [X]/"1 + X + X 4 # yields a root of polynomial
1 + X + X 4 ; other roots are X 2 , X 4 and X 8 .
This example can be used to identify the Hamming [15, 11] code as (an equivalent to) the cyclic code with generator g(X) = 1 + X + X 4 . We can now say that the
Hamming [15, 11] code is (modulo equivalence) the cyclic code of length 15 with
the defining root (= X) in field F2 [X]/"1 + X + X 4 #. As X is a generator of the
multiplicative group of the field, we again could say that the defining root is a
primitive element in F16 .
2
In general, take the field F2 [X]/"g(X)#, where g(X) = gi X i is an irreducible
0id
d1
2
X, X , X 4 , . . . , X 2

binary polynomial of degree m. Then the elements

isfy the equation
i
gi X s = 0, s = 1, 2, . . . , 2d1 .

will sat-

0id

In other words, X, X 2 , . . . , X 2 are precisely the zeros, in field F2 [X]/"g(X)#, of

the irreducible core polynomial q.
Another feature emerging from Example 2.5.35 is that in all parts (a)(c), element X represented the root of the core polynomial g(X). However, this is not true
d1

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

235

in general: it only happens when g(X) is a primitive binary polynomial; for the
detailed discussion of this property see Sections 3.13.3. For a primitive core polynomial g(X) we have, in addition, that the powers X i for i < d = deg g(X) coincide
with X i , while further powers X i , m i 2d 1, are relatively easy to calculate.
With this in mind, we can pass to a general binary Hamming code.
Example 2.5.36 Let X H be the binary Hamming [2 1, 2 1 ] code. We
know that its parity-check matrix H features all non-zero column-vectors of length
. These vectors, written in a particular order, list the consecutive powers i , i =
0, 1, . . . , 2 2, in the field F2 [X]/"g(X)# where = X and g(X) = g0 + g1 X +
+ g X 1 + X is a primitive polynomial of degree . Thus,

1 0 0
g0
0 1 0
g1

H =
(2.5.18)
,
0 0 0 g1
0 0 1
0

or H (1 (1) (2 2) .
Hence, as before, the equation aH T = 0 for the codeword is equivalent to the
equation a( ) = 0 for the corresponding polynomial. So, we can say that a(X)
X H iff is among the roots of a(X).
On the other hand, by construction, is a root of g(X): g( ) = 0. Thus, we
identify the Hamming [2 1, 2 1 ] code as equivalent to the cyclic code of
length 2 1 with the generator polynomial g(X), with(the defining root ). The
1
role of can be played by any conjugate element, from , 2 , . . . , 2
.
The above idea leads to an immediate (and far-reaching) generalisation. Take
N = 2 1 and let be a primitive element of field F2 F2 [X]/"g(X)# where
g(X) is a primitive polynomial. (In all the examples and problems from this chapter,
this requirement is fulfilled.) Consider a defining set of roots, to start with, of the
form , 2 , 3 , but more generally, , 2 , . . . , ( 1) . (Using parameter which
is an integer > 3 is a tradition here.) Consider the cyclic code with these roots:
what can we say about it? With the length N = 2 1, we can guess that it will
yield a subcode of the Hamming [2 1, 2 1 ] code, and it may correct more
than a single error. This is the gist of the so-called (binary) BCH code construction
(BoseChoudhury, Hocquenguem, 1959).
In this section we restrict ourselves to a brief introduction to the BCH codes;
in greater detail and generality these codes are discussed in Section 3.2. For
N = 2 1 field F2 F2 [X]/"g(X)# has the property that its non-zero elements
are the Nth roots of unity (i.e. the zeros of the polynomial 1 + X N ). In other words,

236

Introduction to Coding Theory

polynomial 1 + X N factorises into the product of linear factors

(X j )

1 jN

where all j list the whole of F2 . (In the terminology of Section 3.1, F2 is the
splitting field for 1 + X N over F2 .) As before, we use the notation := X for the
generator of the multiplicative cyclic group F2 . (In fact, it could be any generator
of this group.)
Because N = 1 and the power N is minimal with this property, the element
is often called a primitive Nth root of unity. Consequently, the powers k for
0 k < N yield distinct elements of the field. This fact is used below when we
conclude that the product

kj
ki

= 0

1i< j 1

for every collection of powers k1 , . . . , k 1 . (Such a collection extracts a

( 1) ( 1) submatrix from the ( 1) N parity-check matrix in the proof
of Theorem 2.5.39.)
Definition 2.5.37 Given N = 2 1 and = 3, . . . N, define a narrow-sense binary BCH code XN,BCH
of length N and with designed distance as the cyclic code
formed by binary polynomials a(X) of degree < N such that

(2.5.19)
a( ) = a 2 = = a ( 1) = 0.
In other words, XN,BCH
is the cyclic code of length N whose generator g(X) is the
minimal binary polynomial with roots including , 2 , . . . , ( 1) :
)
(
g(X) = lcm (X ), . . . , (X ( 1) )
= lcm {M (X), . . . , M ( 1) (X)} .

(2.5.20)

Here lcm stands for the least common multiple and M (X) denotes the minimal
binary polynomial with root . For brevity, we will use in this chapter the term
binary BCH codes. (A more general class of BCH codes will be introduced in
Section 3.2.)
Example 2.5.38 For N = 7, the appropriate polynomial field is F2 [X]/"1 + X +
X 3 # or F2 [X]/"1 + X 2 + X 3 #, i.e. one of two realisations of field F8 . Since 7 is a
prime number, any non-zero polynomial from the field has the multiplicative order
7, i.e. is a generator of the multiplicative group in F2 [X]/"1 + X 2 + X 3 #. In fact, we
have the decomposition of polynomial 1 + X 7 into irreducible factors:
1 + X 7 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
Further, if we choose polynomial field F2 [X]/"1 + X + X 3 # then = X satisfies
3
3
3 = 1 + , 2 = 1 + 2, 4 = 1 + 4,

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

237

i.e. the conjugates , 2 and 4 are the roots of the core polynomial 1 + X + X 3 :

1 + X + X 3 = (X ) X 2 X 4 .
Next, 3 , 6 and 12 = 5 are the roots of 1 + X 2 + X 3 :

1 + X2 + X3 = X 3 X 5 X 6 .
Hence, the binary BCH code of length 7 with designed distance 3 is formed by
binary polynomials a(X) of degree 6 such that
a( ) = a( 2 ) = 0, that is, a(X) is a multiple of 1 + X + X 3 .
This code is equivalent to the Hamming [4, 7] code; in particular its true distance
equals 3.
Next, the binary BCH code of length 7 with designed distance 4 is formed by
binary polynomials a(X) of degree 6 such that
a( ) = a( 2 ) = a( 3 ) = 0, that is, a(X) is a multiple of
(1 + X + X 3 )(1 + X 2 + X 3 ) = 1 + X + X 2 + X 3 + X 4 + X 5 + X 6 .
This is simply the repetition code R7 .
The staple of the theory of the BCH codes is
Theorem 2.5.39 (The BCH bound) The minimal distance of a binary BCH code
with designed distance is .
The proof of Theorem 2.5.39 (sometimes referred to as the BCH theorem) is
based on the following result.
Lemma 2.5.40 Consider the m m Vandermonde determinant with entries from
a commutative ring:

1 2 . . . m
1 12 . . . 1m
2 2 . . . 2
2 2 . . . m
m
2
2
2
1

det .
(2.5.21)
. ..
. = det ..
. ..
. .
.
.
.
.
.
.
.
. .
. .
.
.
1m 2m . . . mm
m m2 . . . mm

The value of this determinant is

1lm

(i j ).

(2.5.22)

1i< jm

Proofof Lemma 2.5.40 Both determinants in (2.5.21) are polynomial expressions

in 1 , . . . , m . If = j for i < j then the determinant has repeated rows (columns),
and hence vanishes (as in the standard arithmetic). Hence, the determinant divides

238

Introduction to Coding Theory

the product

(i j ). Next, we compare the powers of i in (2.5.21) and

1i< jm

(2.5.22): this immediately leads to the assertion of Lemma 2.5.40.

Proofof Theorem 2.5.39 Let the polynomial a(X) X . Then a( j ) = 0 for all
j = 1, . . . , 1. That is,

2
...
(N1)
1
a0
1

2
4
...
2(N1)

a1
..

. = 0.
..
..
..
.
..
.
.
.
1 ( 1) 2( 1) . . . (N1)( 1)

aN1

Due to Lemma 2.5.40, any ( 1) columns of this (( 1)N) matrix are linearly
independent. Hence, there must be at least non-zero coefficients in a(X). Thus,
the distance of X is .
Example 2.5.41 (Here a mistake in [18], p. 106, is corrected.) Consider a BCH
code with N = 15 and = 5. Use the following decomposition into irreducible
polynomials:
X 15 1 = (X + 1)(X 2 + X + 1)(X 4 + X + 1)(X 4 + X 3 + 1)
(X 4 + X 3 + X 2 + X + 1).
The generator of the code is
g(x) = (X 4 + X + 1)(X 4 + X 3 + X 2 + X + 1) = X 8 + X 7 + X 6 + X 4 + 1.
Indeed, g( 3 ) = g( 9 ) = 0. The set of zeros of X 4 + X 3 + X 2 + X + 1 is
( 3 , 9 , 12 , 9 ). The set of zeros of X 4 + X + 1 is ( , 2 , 4 , 8 ). The set of
zeros of X 4 + X 3 + 1 is ( 7 , 14 , 13 , 11 ). The set of zeros of X 2 + X + 1 is
( 5 , 10 ).
(b) Let N = 31 and be a primitive element of F32 . The minimal polynomial with
root is
M (X) = (X )(X 2 )(X 4 )(X 8 )(X 16 ).
We find also the minimal polynomial for 5 :
M 5 (X) = (X 5 )(X 10 )(X 20 )(X 9 )(X 18 ).
By definition, the generator of the BCH code of length 31 with a designed distance = 8 is g(X) = lcm(M (X), M 3 (X), M 5 (X), M 7 (X)). In fact, the minimal distance of the BCH code (which is, obviously, at least 9) is in fact at least 11.
This follows from Theorem 2.5.39 because all the powers , 2 , . . . , 10 are listed
among the roots of g(X).

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

239

There exists a decoding procedure for a BCH code which is simple to implement: it generalises the Hamming code decoding procedure. In view Aof Theorem
B
1
er2.5.39, the BCH code with designed distance corrects at least t =
2
rors. Suppose a codeword c = c0 . . . cN1 has been sent and corrupted to r = c + e
where e = e0 . . . eN1 . Assume that e has at most t non-zero entries. Introduce the
corresponding polynomials c(X), r(X) and e(X), all of degrees < N. For c(X) we
have that c( ) = c( 2 ) = = c( ( 1) ) = 0. Then, clearly,

r( ) = e( ), r 2 = e 2 , . . . , r ( 1) = e ( 1) .
(2.5.23)
So, we calculate r( i ) for i = 1, . . . , 1. If these are all 0, r(X) X (no error
or at least t + 1 errors). Otherwise, let E = {i : ei = 1} indicate the erroneous digits
and assume that 0 < E t. Introduce the error locator polynomial

(X) = (1 i X),

(2.5.24)

with binary coefficients, of degree E and with the lowest coefficient 1. If we know
(X), we can find which powers i are its roots and hence find the erroneous
digits i E. We then simply change these digits and correct the errors.
In order to calculate (X), consider the formal power series

(X) = e j X j .
j1

(Observe that, as N = 1, the coefficients of this power series recur.) For the initial
( 1) coefficients, we have equalities, by virtue of (2.5.23):

e j = r j , j = 1, . . . , 1;
these are the only ones needed for our purpose, and they are calculated in terms of
the received word r.
Now set

(X) = i X
iE

(1 j X).

(2.5.25)

jE: j=i

Next, rewrite the above formal series as

(X) =

(X)

i j X j = i j X j = 1 i X = (X) .

j1 iE

iE j1

(2.5.26)

Observe that both polynomials (X) and (X) are of degree E t.

Now, the equation (X) (X) = (X) from (2.5.26) can be written in terms of
the coefficients, with the help of the fact that

e j = r j , j = 1, . . . , 2t;

240

Introduction to Coding Theory

namely,

0 + 1 X + + t X t

r( )X + + + r 2t X 2t + e (2t+1) X 2t+1 +

(2.5.27)

= 0 + 1 X + + t X t .
We are interested in the coefficients of X k for t < k 2t: these satisfy

j r (k j) = 0,

(2.5.28)

0 jt

which does not involve any of the terms e l . We obtain the following equations:

(t+1)
t)
r
(

.
.
.
r(

)
r
0

(t+1)
2

r (t+2)
r
. . . r( ) 1

..
..
.. .. = 0.
..

.
.
.
.
(2t1)

.(2t)
t
t
r
. . . r( )
r
The above matrix is t (t + 1), so it always has a non-zero vector in the kernel;
this vector identifies the error locator polynomial (X). We see that the above
routine (called the BerlekampMassey decoding algorithm) enables us to specify
the set E and hence correct t errors.
Unfortunately, the BCH codes are asymptotically bad: for any sequence of
BCH codes of length N , either k/N or d/N 0. In other words, they lie
at the bottom of Figure 2.2. To obtain codes that meet the GilbertVarshamov
(GV) bound, one needs more powerful methods, based on algebraic geometry. Such
codes were constructed in the early 1970s (the Goppa and Justesen codes). It remains a problem to construct codes that lie above the GilbertVarshamov curve.
As was mentioned just before on page 160, a new class of codes was invented in
1982 by Tsfasman, Vladut and Zink; these codes lie above the GV curve when the
number of symbols in the code alphabet is large. However, for binary codes, the
problem is still waiting for solution.
Worked Example 2.5.42 Compute the rank and minimum distance of the cyclic
code with generator polynomial g(X) = X 3 + X + 1 and parity-check polynomial
h(X) = X 4 + X 2 + X + 1. Now let be a root of g(X) in the field F8 . We receive
the word r(X) = X 5 + X 3 + X (mod X 7 1). Verify that r( ) = 4 , and hence
decode r(X) using minimum-distance decoding.
Solution A cyclic code X of length N has generator polynomial g(X) F2 [X]
and parity-check polynomial h(X) F2 [X] with g(X)h(X) = 1 + X N . Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + + ak X k where ak = 0, then g(X),

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes

241

Xg(X), . . . , X Nk1 g(X) form a basis for X . In particular, the rank of X equals
N k. In this example, N = 7, k = 3 and rank(X ) = 4.
If h(X) = b0 + b1 X + + bNk X Nk then the parity-check matrix H for X has
the form

...
b1
b0
0
... 0 0
bNk bNk1
0
b0
... 0 0
bNk bNk1 . . . b1

.
.
.
.
.
.
..
..
..
..
..
..

...

bNk bNk1 . . . b1 b0
;<
=
N

The codewords of X are linear dependence relations between the columns of H.

In the example,

1 0 1 1 1 0 0
H = 0 1 0 1 1 1 0
0 0 1 0 1 1 1
and we have the following implications:
no zero column
no codewords of weight 1,
no repeated column no codewords of weight 2.
The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword. In the example, d(X ) = 3. [In fact, X is equivalent to the Hamming [7, 4] code.]
>
Since g(X) F2 [X] is irreducible, the code X F2 [X] "X 7 1# is the cyclic
code defined by . The multiplicative cyclic group Z
7 of non-zero elements of
field F8 is

0 = 1, , 2 , 3 = + 1, 4 = 2 + ,
5 = 3 + 2 = 2 + + 1, 6 = 3 + 2 + = 2 + 1,
7 = 3 + = 1.
Next, the value r( ) is
r( ) = + 3 + 5
= + ( + 1) + ( 2 + + 1)
= 2 + = 4,

242

Introduction to Coding Theory

as required. Let c(X) = r(X) + X 4 mod (X 7 1). Then c( ) = 0, i.e. c(X) is a

codeword. Since d(X ) = 3 the code is 1-error correcting. We just found a codeword c(X) at distance 1 from r(X). Then r(X) = X +X 3 +X 5 should be decoded by
c(X) = X + X 3 + X 4 + X 5 mod (X 7 1)
under minimum-distance decoding.
We conclude this section with two useful statements.
Worked Example 2.5.43 (The Euclid algorithm for polynomials) The Euclid algorithm is a method for computing the greatest common divisor of two polynomials, f (X) and g(X), over the same finite field F. Assume that deg g(X) deg f (X)
and set f (X) = r1 (X), g(X) = r0 (X). Then
(1) divide r1 (X) by r0 (X):
r1 (X) = q1 (X)r0 (X) + r1 (X) where deg r1 (X) < deg r0 (X),
(2) divide r0 (X) by r1 (X):
r0 (X) = q2 (X)r1 (X) + r2 (X) where deg r1 (X) < deg r1 (X),

.
..
(k) divide rk1 (X) by rk1 (X):
rk2 (X) = qk (X)rk1 (X) + rk (X) where deg rk (X) < deg Rk1 (X),

...
The algorithm continues until the remainder is 0:
(s) divide rs2 (X) by rs1 (X):
rs2 (X) = qs (X)rs1 (X).

Then

gcd f (X), g(X) = rs1 (X).

(2.5.29)

At each stage, the equation for the current remainder rk (X) involves two previous
remainders. Hence, all remainders, including gcd( f (X), g(X)), can be written in
terms of f (X) and g(X). In fact,
Lemma 2.5.44

The remainders rk (X) in the Euclid algorithm satisfy

rk (X) = ak (X) f (X) + bk (X)g(X), k 1,

2.6 Additional problems for Chapter 2

243

where
a1 (X) = b1 (X) = 0,
a0 (X) = 0, b0 (X) = 1,
ak (X) = qk (X)ak1 (X) + ak2 (X), k 1,
bk (X) = qk (X)bk1 (X) + bk2 (X), k 1.

In particular, there exist polynomials a(X), b(X) such that

gcd ( f (X), g(X)) = a(X) f (X) + b(X)g(X).

Furthermore:
(1) deg ak (X) = deg qi (X), deg bk (X) = deg qk (X).
2ik

(2) deg rk (X) = deg f (X)

(3)
(4)
(5)
(6)
(7)

1ik

deg qk (X).

1ik+1

deg bk (X) = deg f (X) deg rk1 (X).

ak (X)bk+1 (X) ak+ (X)bk (X) = (1)k+1 .
ak (X) and bk (X) are co-prime.
rk (X)bk+1 (X) rk+1 (X)bk (X) = (1)k+1 f (X).
rk+1 (X)ak (X) rk (X)ak+1 (X) = (1)k+1 g(X).

Proof

The proof is left as an exercise.

2.6 Additional problems for Chapter 2

Problem 2.1 A check polynomial h(X) of a binary cyclic code X of length N
is defined by the condition a(X) X if and only if a(X)h(X) = 0 mod (1 + X N ).
How is the check polynomial related to the generator of X ? Given h(X), construct
the parity-check matrix and interpret the cosets X + y of X .
Describe all cyclic codes of length 16 and 15. Find the generators and the check
polynomials of the repetition and parity-check codes. Find the generator and the
check polynomial of Hamming code of length 7.
Solution All cyclic codes of length 16 are divisors of 1 + X 16 = (1 + X)16 , i.e. are
generated by g(X) = (1 + X)k where k = 0, 1, . . . , 16. Here k = 0 gives the whole
{0, 1}16 , k = 1 the parity-check code, k = 15 the repetition code {00 . . . 0, 11 . . . 1}
and k = 16 the zero code. For N = 15, the decomposition into irreducible polynomials looks as follows:
1 + X 15 = (1 + X)(1 + X + X 2 )(1 + X + X 4 )(1 + X 3 + X 4 )
(1 + X + X 2 + X 3 + X 4 ).
Any product of the listed irreducible polynomials generates a cyclic code.

244

Introduction to Coding Theory

In general, 1 + X N = (1 + X)(1 + X + + X N1 ); g(X) = 1 + X generates the

parity-check code and g(X) = 1 + X + + X N1 the repetition code. In the case
of a Hamming [7, 4] code, the generator is g(X) = 1 + X + X 3 , by inspection.

The check polynomial h(X) equals the ratio (1 + X N ) g(X). In fact, for all
a(X) X , a(X)h(X) = v(X)g(X)h(X) = v(X)(1 + X N ) = 0 mod (1 + X N ). Conversely, if a(X)h(X) = v(X)(1 + X N ) then a(X) must be of the form v(X)g(X), by
the uniqueness of the irreducible decomposition.
The cosets y + X are in a one-to-one correspondence with the remainders
y(X) = u(X) mod g(X). In other words, two words y(1) , y(2) belong to the same
coset iff, in the division algorithm representation,
y(i) (X) = vi (X)g(X) + u(i) (X), i = 1, 2, where u(1) (X) = u(2) (X).
In fact, y(1) and y(2) belong to the same coset iff y(1) + y(2) X . This is equivalent
to u(1) (X) + u(2) (X) = 0, i.e. u(1) (X) = u(2) (X).
k

If we write h(X) = h j X j , then the dot-product

j=0

g j hi j =

j=0

i = 0, N,

1 i < N.

So, "g(X) h (X)# = 0 where h (X) = hk + hk1 X + + h0 X k . Therefore, the

parity-check matrix H for X is formed by rows that are cyclic shifts of h =
hk hk1 h0 0 0 . The check polynomials for the repetition and parity-check
codes then are 1 + X and 1 + X + + X N1 , and they are dual of each other.
The check polynomial for the Hamming [7, 4] code equals 1 + X + X 2 + X 4 , by
inspection.
Problem 2.2
(a) Prove the Hamming and GilbertVarshamov bounds on the
size of a binary [N, d] code in terms of vN (d), the volume of an N -dimensional
Hamming ball of radius d .
Suppose that the minimum distance is N for some fixed (0, 1/4). Let
(N, N) be the largest information rate of any binary code correcting N
errors. Show that
1 ( ) lim inf (N, N) lim sup (N, N) 1 ( /2).
N

(2.6.1)

(b) Fix R (0, 1) and suppose we want to send one of a collection UN of messages

of length N , where the size UN = 2NR . The message is transmitted through an

2.6 Additional problems for Chapter 2

245

MBSC with error-probability p < 1/2, so that we expect about pN errors. According to the asymptotic bound of part (a), for which values of p can we correct pN
errors, for large N ?
Solution (a) A code X FN2 is said to be E-error correcting if B(x, E) B(y, E) =
0/ for all x, y X with x = y. The Hamming bound for a code of size M, distance
d 1
d, correcting E =
errors is as follows. The balls of radius E about the
2
codewords are disjoint: their total volume equals M vN (E). But their union lies
inside FN2 , so M 2N /vN (E).
On the other hand, take an E-correcting code X of maximum size X . Then
there will be no word
y FN2 \ xX B(x, 2E + 1)
or we could add such a word to X , increasing the size but preserving the errorcorrecting property. Since every word y FN2 is less than d 1 from a codeword,
we can add y to the code. Hence, balls of radius d 1 cover the whole of FN2 , i.e.
M vN (d 1) 2N , or
M 2N /vN (d 1) (the VarshamovGilbert bound).

Combining these bounds yields, for (N, E) = log X N:
1

log vN (E)
log vN (2E + 1)
(N, E) 1
.
N
N

Observe that for any s < N with 0 < < 1/2

N
N
N
.
<
=
s1
N s+1 s
1 s
Consequently,

E
j

N
N
vN (E)
1 .
E
E j=0

Now, by the Stirling formula as N, E and E/N (0, 1/4)

1
N
log
( /2).
E
N
So, we proved that limN N1 log vN ([ N]) = ( ), and
1 ( ) lim inf
N

1
1
log M lim sup log M 1 ( /2).
N
N
N

246

Introduction to Coding Theory

B
d 1
(b) We can correct pN errors if the minimum distance d satisfies
pN,
2
i.e. /2 p. Using the asymptotic Hamming bound we obtain R 1 ( /2)
1 (p). So, the reliable transmission is possible if p 1 (1 R),
The Shannon SCT states:
capacity C of a memoryless channel = sup I(X : Y ).
pX

Here I(X : Y ) = h(Y ) h(Y |X) is the mutual entropy between the single-letter
random input and output of the channel, maximised over all distributions of the
input letter X. For an MBSC with the error-probability p, the conditional entropy
h(Y |X) equals (p). Then
C = sup h(Y ) (p).
pX

But h(Y ) attains its maximum 1, by using the equidistributed input X (then Y is also
equidistributed). Hence, for the MBSC, C = 1 (p). So, a reliable transmission is
possible via MBSC with R 1 (p), i.e. p 1 (1 R). These two arguments
lead to the same answer.
Problem 2.3 Prove that the binary code of length 23 generated by the polynomial
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimum distance 7, and is perfect.
Hint: Observe that by the BCH bound (see Theorem 2.5.39) if a generator polynomial of a cyclic code has roots { , 2 , . . . , 1 } then the code has distance ,
and check that X 23 + 1 (X + 1)g(X)grev (X) mod 2, where grev (X) = X 11 g(1/X)
is the reversal of g(X).
Solution First, show that the code is BCH, of designed distance 5. Recall that if
is a root of a polynomial p(X) F2 [X] then so is 2 . Thus, if is a root of
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are 2 , 4 , 8 , 16 , 9 , 18 ,
13 , 3 , 6 , 12 . This yields the design sequence { , 2 , 3 , 4 }. By the BCH
theorem, the code X = "g(X)# has distance 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by
(X i g(X)|1) and (X j g(X)|1)

2.6 Additional problems for Chapter 2

247

and their dot-product is

1 + (X i g(X))(X j g(X)) = 1 + gi+r g j+r = 1 + gi+r grev
11 jr
r

= 1 + coefficient of X 11+i j in g(X) grev (X)

;<
=
:
||
1 + + X 22
= 1 + 1 = 0.
So,
any two words in X + are dot-orthogonal.

(2.6.2)

This implies that all words in X + have weight divisible by 4. Indeed, by inspection, all rows (X i g(X)|1) of the generating matrix of X + have weight 8.
Then, by induction on the number of rows involved in the sum, if c X + and
g(i) (X i g(X)|1) is a row of the generating matrix of X + then

w g(i) + c = w g(i) + w(c) 2w g(i) c ,

where g(i) c l = min g(i) l , cl , l = 1, . . . , 24. We know that 8|w g(i) and by
(i)
(i)

c
is
even,
so
2w
g c is divisthe induction hypothesis, 4|w(c).
Next,
w
g

(i)
ible by 4. Then the LHS, w g + c , is divisible by 4. Therefore, the distance of
X + is 8, as it is 5 and is divisible by 4. (Clearly, it cannot be bigger than 8 as
then it would be 12.) Then the distance of the original code, X , equals 7.
Finally, the code X is perfect 3-error correcting, since the volume of the 3-ball
in F23
2 equals

23
23
23
23
= 1 + 23 + 253 + 1771 = 2048 = 211 ,
+
+
+
3
2
1
0
and 212 211 = 223 . Here, obviously, 12 represents the rank and 23 the length.
Problem 2.4
Show that the Hamming code is cyclic with check polynomial
4
2
X + X + X + 1. What is its generator polynomial? Does Hammings original code
contain a subcode equivalent to its dual? Let the decomposition into irreducible
monic polynomials M j (X) be
l

X N + 1 = M j (X)k j .
j=1

Prove that the number of cyclic code of length N is lj=1 (k j + 1).

Solution In F72 we have
X 7 1 = (X 3 + X + 1)(X 4 + X 2 + X + 1).

(2.6.3)

248

Introduction to Coding Theory

The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is

1 0 1 1 1 0 0
0 1 0 1 1 1 0 .
(2.6.4)
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, this is equivalent
to Hammings original [7, 4] code.
The dual of Hammings [7, 4] code has the generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hammings [7, 4] code.
Finally, any irreducible polynomial M j (X) could be included in a generator of a
cyclic code in any power 0, . . . , k j . So, the number of possibilities to construct this
generator equals lj=1 (k j + 1).
Problem 2.5
Describe the construction of a ReedMuller code. Establish its
information rate and its distance.
m
m
Solution The space Fm
2 has N = 2 points. If A F2 , let 1A be the indicator
function of A. Consider the collection of hyperplanes

j = {p Fm
2 : p j = 0}.
Set h j = 1 j , j = 1, . . . , m, and h0 = 1Fm2 1. Define sets of functions Fm
2 F2 :
A0 = {h0 },
A1 = {h j ; j = 1, 2, . . . , m},
A2 = {hi h j ; i, j = 1, 2, . . . , m, i < j},
..
.
Ak+1 = {a h j ; a Ak , j = 1, 2, . . . , m, h j |a},
..
.
Am = {h1 hm }.
The union of these sets has cardinality N = 2m (there are 2m functions altogether).
N
Therefore, functions from m
i=0 Ai can be taken as a basis in F2 .
RM of length N = 2m is defined as
Then the ReedMuller code RM(r, m) = Xr,m

r
m
the span of ri=0 Ai and has rank
. Its information rate is
i=0 i

1 r m
i .
2m i=0

2.6 Additional problems for Chapter 2

249

Next, if a RM(r, m) then

a = (y, y)h j + (x, x) = (x, x + y),
for some x RM(m 1, r) and y RM(m 1, r 1). Thus, RM(m, r) coincides
with the bar-product (R(m 1, r)|R(m 1, r 1)). By the bar-product bound,

d RM(m, k) min 2d RM(m 1, k) , d RM(m 1, k 1) ,
which, by induction, yields

d RM(r, m) 2mr .
On the other hand, the vector h1 h2 hm is at distance 2mr from RM(m, r).
Hence,

d RM(r, m) = 2mr .

Problem 2.6 (a) Define a parity-check code of length N over the field F2 . Show
that a code is linear iff it is a parity-check code. Define the original Hamming code
in terms of parity-checks and then find a generating matrix for it.
(b) Let X be a cyclic code. Define the dual code
N

X = {y = y1 . . . yN : xi yi = 0 for all x = x1 . . . xN X }.
i=1

Prove that X is cyclic and establish how the generators of X and X are related to each other. Show that the repetition and parity-check codes are cyclic, and
determine their generators.
Solution (a) The parity-check code X PC of a (not necessarily linear) code X is
the collection of vectors y = y1 . . . yN FN2 such that the dot-product
N

y x = xi yi = 0 (in F2 ),

for all x = x1 . . . xN X .

i=1

From the definition it is clear that X PC is also the parity-check code for X , the
PC
linear code spanned by X : X PC = X . Indeed, if y x = 0 and y x = 0 then
y (x + x ) = 0. Hence, the parity-check code X PC is always linear, and it forms
a subspace dot-orthogonal to X . Thus, a given code X is linear iff it is a paritycheck code. A pair of linear codes X and X PC form a dual pair: X PC is the dual
of X and vice versa. The generating matrix H for X PC serves as a parity-check
matrix for X and vice versa.

250

Introduction to Coding Theory

The Hamming code of length N = 2l 1 is the one whose check matrix is l N

and lists all non-zero columns from Fl2 (in some agreed order). So, the Hamming
[7, 4] code corresponds to l = 3; its parity-checks are
x1 + x3 + x5 + x7 = 0,
x2 + x3 + x6 + x7 = 0,
x4 + x5 + x6 + x7 = 0,
and the generating matrix equals

1 1
0 1

0 0
0 0

0
1
1
0

1
0
1
1

0
1
0
1

0
0
1
0

0
0
.
0
1

(b) The generator of dual code g (X) = X N1 g(X 1 ). The repetition code has
g(X) = 1 + X + + X N1 and the rank 1. The parity-check code has g(X) = 1 + X
and the rank N 1.
Problem 2.7 (a) How does coding theory apply when the error rate p > 1/2?
(b) Give an example of a code which is not a linear code.
(c) Give an example of a linear code which is not a cyclic code.
(d) Define the binary Hamming code and its dual. Prove that the Hamming code is
perfect. Explain why the Hamming code cannot always correct two errors.
(e) Prove that in the dual code:
(i) The weight of any non-zero codeword equals 21 .
(ii) The distance between any pair of words equals 21 .
Solution (a) If p > 1/2, we reverse the output to get p = 1 p.
(b) The code X F22 with X = {11} is not linear as 00 X .
(c) The code X F22 with X = {00, 10} is linear, but not cyclic, as 01 X .
(d) The original Hamming [7, 4] code has distance 3 and is perfect one-error
correcting. Thus, making two errors in a codeword will always lead outside the
ball of radius 1 about the codeword, i.e. to a ball of radius 1 about a different
codeword (at distance 1 of the nearest, at distance 2 from the initial word). Thus,
one detects two errors but never corrects them.
(e) The dual of a Hamming [2 1, 2 1, 3] code is linear, of length N = 2 1
and rank , and its generating matrix is (2 1), with columns listing all nonzero vectors of length (the parity-check matrix of the original code). The rows of
this matrix are linearly independent; moreover, any row i = 1, . . . , has 21 digits
1. This is because each such digit comes from a column, i.e. a non-zero vector of
length , with 1 in position i; there are exactly 21 such vectors. Also any pair of

2.6 Additional problems for Chapter 2

251

columns of this matrix are linearly independent, but there are triples of columns
that are linearly dependent (a pair of columns complemented by their sum).
Every non-zero dual codeword x is a sum of rows of the above generating matrix.
Suppose these summands are rows i1 , . . . , is where 1 i1 < < is . Then, as
above, the number of digits 1 in the sum equals the number of columns of this
matrix for which the sum of digits i1 , . . . , is is 1. We have no restriction on the
remaining s digits, so for them there are 2s possibilities. For digits i1 , . . . , is
we have 2s1 possibilities (a half of the total of 2s ). Thus, again 2s 2s1 = 21 .
We proved that the weight of every non-zero dual codeword equals 21 . That is,
the distance from the zero vector to any dual codeword is 21 . Because the dual
code is linear, the distance between any pair of distinct dual codewords x, x equals
21 :

(x, x ) = (0, x x) = w(x x ) = 21 .

Let J {1, . . . , } be the set of contributing rows:
x = g(i) .
iJ

Then (0, x) = of non-zero digits in x is calculated as

of subsets K J with |K| odd

2|J|

of ways to place

of ways to get xi = 1 mod 2

0s and 1s outside J

with xi = 0 or 1

which yields 2|J| 2|J|1 = 21 . In other words, to get a contribution from a digit
(i)
x j = g j = 1, we must fix (i) a configuration of 0s and 1s over {1, . . . , } \ J (as it
iJ

is a part of the description of a non-zero vector of length N), and (ii) a configuration
of 0s and 1s over J,
an odd number of 1s.
with

To check that d X H

= 21 , it suffices to establish that the distance between

the zero word and any other word x X H equals 21 .

Problem 2.8
(a) What is a necessary and sufficient condition for a polynomial
g(X) to be the generator of a cyclic code of length N ? What is the BCH code?
Show that the BCH code associated with { , 2 }, where is a root of X 3 + X + 1
in an appropriate field, is Hammings original code.
(b) Define and evaluate the Vandermonde determinant. Define the BCH code and
obtain a good estimate for its minimum distance.

252

Introduction to Coding Theory

Solution (a) The necessary and sufficient condition for g(X) being the generator of
a cyclic code of length N is g(X)|(X N 1). The generator g(X) may be irreducible
or not; in the latter case it is represented as a product g(X) = M1 (X) Mk (X)
of its irreducible factors, with k d = deg g. Let s be the minimal number such
that N|2s 1. Then g(X) is factorised into the product of first-degree monomials
d

in a field K = F2s F2 : g(X) = (X j ) with 1 , . . . , d K. [Usually one

i=1

refers to the minimal field the splitting field for g, but this is not necessary.] Each
element i is a root of g(X) and also a root of at least one of its irreducible factors
M1 (X), . . . , Mk (X). [More precisely, each Mi (X) is a sub-product of the above firstdegree monomials.]
We want to select a defining set D of roots among 1 , . . . , d K: it is a collection comprising at least one root ji for each factor Mi (X). One is naturally tempted
to take a minimal defining set where each irreducible factor is represented by one
root, but this set may not be easy to describe exactly. Obviously, the cardinality |D|
of defining set D is between k and d. The roots forming D are all from field K but
in fact there may be some from its subfield, K K containing all ji . [Of course,
F2 K .] We then can identify the cyclic code X generated by g(X) with the set
of polynomials
+
*
f (X) F2 [X]/"X N 1# : f ( ) = 0 for all D .
It is said that X is a cyclic code with defining set of roots (or zeros) D.
(b) A binary BCH code of length N (for N odd) and designed distance is a cyclic
code with defining set { , 2 , . . . , 1 } where N and is a primitive Nth
root of unity, with N = 1. It is helpful to note that if is a root of a polynos1
mial p(X) then so are 2 , 4 , . . . , 2 . By considering a defining set of the form
{ , 2 , . . . , 1 } we fill the gaps in the above diadic sequence and produce an
ideal of polynomials whose properties can be analytically studied.
The simplest example is where N = 7 and D = { , 2 } where is a root of
3
X + X + 1. Here, 7 = ( 3 )2 = ( + 1)2 = 3 + = 1, so is a 7th root of
unity. [We used the fact that the characteristic is 2.] In fact, it is a primitive root.
Also, as was said, 2 is a root of X 3 + X + 1: ( 2 )3 + 2 + 1 = ( 3 + + 1)2 =
0, and so is 4 . Then the cyclic code with defining set { , 2 } has generator
X 3 +X +1 since all roots of this polynomial are engaged. We know that it coincides
with the Hamming [7, 4] code.
The Vandermonde determinant is

1
1
1
...
1
x1
x2
x3 . . . xn
.
= det
...
...
... ... ...
x1n1 x2n1 x3n1 . . . xnn1

2.6 Additional problems for Chapter 2

253

Observe that if xi = x j (i = j) the determinant vanishes (two rows are the same).
Thus xi x j is a factor of ,
= P(x) (xi x j ),
i< j

with P a polynomial in x1 , . . . , xn . Now consider terms in expansion in the sum

m(i)
of terms of form a i xi with m(i) = 0 + 1 + + (n 1) = n(n 1)/2. But
m(i)
i< j (xi x j ) is a sum of terms a i xi with m(i) = n(n1)/2, so P(x) = const.
Considering x2 x32 . . . xnn1 we have const = 1, so
= (xi x j ).

(2.6.5)

i< j

Suppose N is odd and K is a field containing F2 in which X N 1 factorises into

linear factors. [This field can be selected as F2s where N|2s 1.] A cyclic code
N1

consisting of words c = c0 c1 . . . cN1 with c j r j = 0 for all r = 1, 2, . . . , 1

j=0

where is a primitive Nth root of unity is called a BCH code of design distance
< N. Next, X BCH is a vector space over F2 and c X BCH iff
cH T = 0
where

1
1

H = 1
.
..

2
3
..
.

2
4
6
..
.

...
...
...
..
.

(2.6.6)

N1
2N2
3N3
..
.

(2.6.7)

1 1 2 2 . . . (N1)( 1)
Now rank H = . Indeed, by (2.6.5) for any minor H
det H = ( i j ) = 0.
i< j

Thus (2.6.6) tells us that

c X , c = 0 |c j | .
So, the minimum distance in X BCH is not smaller than .
Problem 2.9 A subset X of the Hamming space {0, 1}N of cardinality X = M
and with the minimal Hamming distance d = min[ (x, x ) : x, x X , x = x ]
is called an [N, M, d] code (not necessarily linear). An [N, M, d] code is called
maximal if it is not contained in any [N, M + 1, d] code. Prove that an [N, M, d]
code is maximal if and only if for any y {0, 1}N there exists x X such that

254

Introduction to Coding Theory

(x, y) < d . Conclude that if d or more changes are made in a codeword then the
new word is closer to some other codeword than to the original one.
Suppose that a maximal [N, M, d] code is used for transmitting information via a
binary memoryless channel with the error-probability p, and the receiver uses the
maximum likelihood decoder. Prove that the probability of erroneous decoding,
ML , obeys the bounds
err
ML
1 b(N, d 1) err
1 b(N, (d 1)/2),

where b(N, m) is a partial binomial sum

N k
b(N, m) =
p (1 p)Nk .
k
0km
Solution If a code is maximal then adding one more word will reduce the distance.
Hence, for all y there exists x X such that (x, y) < d. Conversely, if this property holds then the code cannot be enlarged without reducing d. Then making d or
more changes in a codeword gives a word that is closer to a different codeword.
This will certainly not give the correct guess under the ML decoder as it chooses
the closest codeword.
Therefore,

N k
ML
p (1 p)Nk = 1 b(N, d 1).
err
k
dkN

On the other hand, the code corrects (d 1) 2 errors. Hence,

ML
err
1 b (N, d 1/2) .

Problem 2.10
The Plotkin bound for an [N, M, d] binary code states that M
d
if d > N/2. Let M2 (N, d) be the maximum size of a code of length N and
d N/2
distance d , and let
1
( ) = lim log2 M2 (N, N).
N N

Deduce from the Plotkin bound that ( ) = 0 for 12 .

Assuming the above bound, show that if d N/2, then
M 2N(2d1)

d
= 2d 2N(2d1) .
d (2d 1)/2

Deduce the asymptotic Plotkin bound: ( ) 1 2 , 0 < 12 .

2.6 Additional problems for Chapter 2

255

Solution If d > N/2 apply the Plotkin bound and conclude that ( ) = 0. If
d N/2 consider the partition of a code X of length N and distance d N/2
according to the last N (2d 1) digits, i.e. divide X into disjoint subsets, with
fixed N (2d 1) last digits. One of these subsets, X , must have size M such
that M 2N(2d1) M.
Hence, X is a code of length N = 2d 1 and distance d = d, with d > N /2.
Applying Plotkins bound to X gives
M

d
d
=
= 2d.

d N/2 d (2d 1)/2

Therefore,
M 2N(2d1) 2d.
Taking d = N with N yields ( ) 1 2 , 0 1/2.
Problem 2.11 State and prove the Hamming, Singleton and GilbertVarshamov
bounds. Give (a) examples of codes for which the Hamming bound is attained, (b)
examples of codes for which the Singleton bound is attained.
Solution The Hamming bound states that the size M of an E-error correcting code
X of length N,
2N
M
,
vN (E)

N
is the volume of an E-ball in the Hamming space
where vN (E) =
i
0iE
{0, 1}N . It follows from the fact that the E-balls about the codewords x X must
be disjoint:
M vN (E) = of points covered by M E-balls
2N = of points in {0, 1}N .
The Singleton bound is that the size M of a code X of length N and distance d
obeys
M 2Nd+1 .
It follows by observing that truncating X (i.e. omitting a digit from the codewords
x X ) d 1 times still does not merge codewords (i.e. preserves M) while the
resulting code fits in {0, 1}Nd+1 .
The GilbertVarshamov bound is that the maximal size M = M2 (N, d) of a
binary [N, d] code satisfies
2N
M
.
vN (d 1)

256

Introduction to Coding Theory

This bound follows from the observation that any word y {0, 1}N must be within
distance d 1 from a maximum-size code X . So,
M vN (d 1) of points within distance d 1 = 2N .
Codes attaining the Hamming bound are called perfect codes, e.g. the Hamming

[2 1, 2 1 , 3] codes. Here, E = 1, vN (1) = 1 + 2 1 = 2 and M = 22 1 .
Apart from these codes, there is only one example of a (binary) perfect code: the
Golay [23, 12, 7] code.
Codes attaining the Singleton bound are called maximum distance separable
(MDS): their check matrices have any N M rows linearly independent. Examples
of such codes are (i) the whole {0, 1}N , (ii) the repetition code {0 . . . 0, 1 . . . 1}
and the collection of all words x {0, 1}N of even weight. In fact, these are all
examples of binary MDS codes. More interesting examples are provided by Reed
Solomon codes that are non-binary; see Section 3.2. Binary codes attaining the
GilbertVarshamov bound for general N and d have not been constructed so far
(though they have been constructed for non-binary alphabets).
Problem 2.12 (a) Explain the existence and importance of error correcting codes
to a computer engineer using Hammings original code as your example.
(b) How many codewords in a Hamming code are of weight 1? 2? 3? 4? 5?
Solution (a) Consider the linear map F72 F32 given by the matrix H of the form
(2.6.4). The Hamming code X is the kernel ker H, i.e. the collection of words
x = x1 x2 x3 x4 x5 x6 x7 {0, 1}7 such that xH T = 0. Here, we can choose four digits,
say x4 , x5 , x6 , x7 , arbitrarily from {0, 1}; then x1 , x2 , x3 will be determined:
x1 = x4 + x5 + x7 ,
x2 = x4 + x6 + x7 ,
x3 = x5 + x6 + x7 .
It means that code X can be used for encoding 16 binary messages of length 4.
If y = y1 y2 y3 y4 y5 y6 y7 differs from a codeword x X in one place, say y = x + ek
then the equation yH T = ek H T gives the binary decomposition of number k, which
leads to decoding x. Consequently, code X allows a single error to be corrected.
Suppose that the probability of error in any digit is p << 1, independently of
what occurred to other digits. Then the probability of an error in transmitting a
non-encoded (4N)-digit message is
1 (1 p)4N 4N p.

2.6 Additional problems for Chapter 2

257

But using the Hamming code we need to transmit 7N digits. An erroneous transmission requires at least two wrong digits, which occurs with probability

N
7 2
1 1
p
21N p2 << 4N p.
2
So, the extra effort of using 3 check digits in the Hamming code is justified.
(b) A Hamming code X H, of length N = 2 1 ( 3) consists of binary words
x = x1 . . . xN such that xH T = 0 where H is an N matrix whose columns
h(1) , . . . , h(N) are all non-zero binary vectors of length l. Hence, the number of
N

codewords of weight w(x) = x j = s equals the number of (non-ordered) collecj=1

tions of s binary, non-zero, pair-wise distinct -vectors of total sum 0. In fact, if

xH T = 0 and w(x) = s and x j1 = x j2 = = x js = 1 then the sum of row-vectors
h( j1 ) + + h( js ) = 0.
Thus, one codeword has weight 0, no codeword has weight 1 or 2, N(N 1)/3!
codewords have weight 3 (i.e. 7 and 35 words of weights 3 for l = 3 and l = 4).
Further we have [N(N 1)(N 2) N(N 1)]/4! = N(N 1)(N 3)/4! words
of weight 4 (i.e. 7 and 105 words of weights 4 for = 3 and = 4). Finally, we
have N(N 1)(N 3)(N 7)/5! words weight 5 (i.e. 0 and 168 words of weight
5 for = 3 and = 4). Each time when we add a factor, we should avoid -vectors
equal to a linear combination of previously selected vectors. In Problem 3.9 we
will compute the enumerator polynomial for N = 15:
1 + 35X 3 + 105X 4 + 168X 5 + 280X 6 + 435X 7 + 435X 8
+280X 9 + 168X 10 + 105X 11 + 35X 12 + X 15 .

Problem 2.13 (a) The dot-product of vectors x, y from a binary Hamming space
HN is defined as x y = Ni=1 xi yi (mod 2), and x and y are said to be orthogonal
if x y = 0. What does it mean to say that X HN is a linear [N, k] code with
generating matrix G and parity-check matrix H ? Show that
X = {x HN : x y = 0 for all y X }

is a linear [N, N k] code and find its generator and parity-check matrices.
(b) A linear code X is called self-orthogonal if X X . Prove that X is selforthogonal if the rows of G are self and pairwise orthogonal. A linear code is called
self-dual if X = X . Prove that a self-dual code has to be an [N, N/2] code (and
hence N must be even). Conversely, prove that a self-orthogonal [N, N/2] code, for
N even, is self-dual. Give an example of such a code for any even N and prove that
a self-dual code always contains the word 1 . . . 1.

258

Introduction to Coding Theory

(c) Consider now a Hamming [2 1, 2 1] code XH, . Describe the generating
. Prove that the distance between any two codewords in X equals
matrix of XH,
H,
1
2 .
Solution By definition, X is preserved under the linear operations; hence X
is a linear code. From algebraic considerations, dim X = N k. The generating
matrix G of X coincides with H, and the parity-check matrix H with G.
If X X then the rows g(1) , . . . , g(k) of G are self- and pairwise orthogonal.
The converse is also true. From the previous observation, if X is self-dual then
k = N k, i.e. k = N/2, and N should be even. Similarly, if X is self-orthogonal
and k = N/2 then X is self-dual.
Let 1 = 1 . . . 1. If X = X then 1 g(i) = g(i) g(i) = 0. So, 1 X and hence
1 X . An example is a code with the generating matrix

1 1 1 ... 1
1 1 0 . . . 0

G = 1 0 1 . . . 0
. . . .
.. .. .. . . ...
1 0 0 ... 1
N/2

1
1
1
..
.

...
...
...
..
.

1
0

..
.

1 ... 1
N/2

N/2.

The dual XH of a Hamming code X H is called a simplex code. By the above,

it has length 2 1 and rank , and its generating matrix GH is (2 1), with
columns listing all non-zero vectors of length . To check that dist X H = 21 ,
it suffices to establish that the weight of non-zero word x X H equals 21 . But
a non-zero word x X H is a non-zero linear combination of rows of GH . Let
J {1, . . . , } be the set of contributing rows:
x = g(i) .
iJ

Clearly, w(g(i) ) = 21 as exactly half of all 2 vectors have 1 on any given position.
The proof is finished by induction on J.
A simple and elegant way is to use the MacWilliams identity (cf. Lemma 3.4.4)
which immediately gives
WX (s) = 1 + (2 1)s2

(2.6.8)

It is instructive to present this derivation. We will establish in Problem 3.9 the

formula for a weight enumeration polynomial of Hamming code. Then substituting

2.6 Additional problems for Chapter 2

259

this expression into the MacWilliams identity one gets

1
1
1s N
WX (s) = N
1+
2
N +1
1+s

(N1)/2

1 s (N+1)/2
1s
N
1
(1 + s)N
1+
+
N+ 1
1+s
1+s
1 2 1 21
= 2
+
s
2
2
which is equivalent to (2.6.8).
Problem 2.14 Describe briefly the decoding procedure for the Hamming [2 1,
2 1 ] code.
The codewords of the Hamming [7, 4] code, with the lexicographical paritycheck matrix H of the form (2.3.4a), are used for encoding 16 symbols, the first 15
letters of the alphabet and the space character . The encoding rule is
A
B
C
D

0011001
0100101
0010110
1110000

E
F
G
H

0111100
0001111
1101001
0110001

I
J
K
L

1010101
1100110
0101010
1001100

M
N
O

1111111
1000011
0000000
1011010

You have received a 105-digit message

1000110 0000000 0110001 1000011 1000011 1110101
0111100 0011010 0100101 0111100 1011000 1101001
0000000 0010000 1010000

where some words are corrupted. Decode the received message.

Solution The Hamming [2 1, 2 1 ] code, = 2, 3, . . ., is obtained as a collection of binary strings x = x1 . . . xN of length N = 2 1 such that xH T = 0.
Matrix H is ( 2 1), with 2 1 non-zero binary strings as columns; that is,

1 0 ... 0 1 ... 1
0 1 ... 0 0 ... 1

H =
. . . . . . . . . . . . . . . . . . . . . .
0 0 ... 1 1 ... 1
Here the columns are meant to be lexicographically ordered. Different matrices
obtained from the above by permuting the rows define different, but equivalent,
codes: they are all named Hamming codes.
To perform decoding, we have to fix a matrix H (the check matrix) and let it
be known to both the sender and the receiver. Upon receiving a word (string) y =
y1 . . . yN we form a syndrome vector yH T . If yH T = 0, we decode y by itself. (We

260

Introduction to Coding Theory

have no means to determine if the original codeword was corrupted by the channel
or not.)
If yH T = 0 then yH T coincides with a column of H. Suppose yH T gives column
j of H; then we decode y by
x = y + e j where e j = 0 . . . 1 . . . 0 (1 in digit j).
In other words, we change digit j in y and decide that it was the word sent through
the channel. This works well when errors in the channel are rare.
If = 3 a Hamming [7, 4] code contains 24 = 16 codewords. These codewords
are fixed when H is fixed: in the example they are used for encoding 15 letters from
A to O and the space character . Upon receiving a message we divide it into words
of length 7: in the example there are 15 words altogether. Performing the decoding
procedure leads to
JOHNNIEBEGOOD

Problem 2.15 A (binary) Hamming code of length N = 2 1, where 2, is

defined as a linear binary code with a parity-check matrix H whose columns consist
of all non-zero binary vectors of length . Find the rank of such a code (i.e. the
dimension of the corresponding linear subspace) and the number of the codewords.
Find the minimum distance for the code and prove that it is single-error correcting.
Prove that the code is perfect (i.e. the union of the one-balls around the codewords
covers the space of all words).
Give a parity-check matrix and a generating matrix for a Hamming code with =
3. What is the information rate of this code? Why is the case = 2 not interesting?
Solution The parity-check matrix H for the Hamming code is 2 1 and formed
by all non-zero columns of length ; in particular, it includes all l columns of
weight 1. The latter are linearly independent; hence the l columns of H are linearly independent. Since XHam = ker H, we have dim X = 2 1 = rank X .

The number of codewords then equals 22 1 .
Since all columns of H are distinct, any pair of columns are linearly independent.
So, the minimal distance of X is > 2. But X contains three columns that are
linearly dependent, e.g.
1 0 0 . . . 0T ,

0 1 0 . . . 0T ,

and

1 1 0 . . . 0T .

Hence, the minimal distance equals 3. Therefore, if a single error occurs, i.e. the
received word is at distance 1 from a codeword, then this codeword is uniquely
determined. Hence, the Hamming code is single-error correcting.

2.6 Additional problems for Chapter 2

261

To prove that it is perfect, we must check that

the of codewords the volume of a one-ball
= the total of words.
In fact, denoting 2l 1 = N, we have
the of codewords = 22 1l = 2Nl ,

N
N
the volume of a one-ball =
+
= 1 + N,
0
1
the total of words = 2N ,
l

and
(1 + N)2N = 2 2N = 2N .
The information rate of the code equals

2 1
.
rank length =
2 1
The code with = 3 has the 3 7 parity-check matrix of the form (2.6.4); any
permutation of rows leads to an equivalent code. The generating matrix is 4 7:

1 0 0 0 1 1 1
0 1 0 0 0 1 1

0 0 1 0 1 0 1
0 0 0 1 0 1 1
and the information rate 4/7. The Hamming code with = 2 is trivial: it contains
a single non-zero codeword 1 1 1.
Problem 2.16
Define a BCH code of length N over the field Fq with designed
distance . Show that the minimum weight of such a code is at least .
Consider a BCH code of length 31 over the field F2 with designed distance 8.
Show that the minimum distance is at least 11.
Solution A BCH code of length N over the field Fq is defined as a cyclic code X
whose minimum degree generator polynomial g(X) Fq [X], with g(X)|(X N 1)
(and hence deg g(X) N), contains among its roots the subsequent powers ,
2 , . . . , 1 where Fqs is a primitive Nth root of unity. (This root lies
in an extension field Fqs the splitting field for X N 1 over Fq , i.e. N|qs 1.) Then
is called the designed distance for X ; the actual distance (which may be difficult
to calculate in a general situation) is .
If we consider the binary BCH code X of length 31, should be a primitive
root of unity of degree 31, with 31 = 1 (the root lies in an extension field F32 ).

262

Introduction to Coding Theory

We know that in the binary arithmetic, if a polynomial f (X) F2 [X], of order s,

s1
has a root , it has roots 2 , 4 , . . . , 2 , i.e.

r
X 2 f (X), r = 0, . . . , s 1.
Thus, given that the generator g(X) of X has roots , 2 , 3 , 4 , 5 , 6 , 7 , it
will also have roots

8 = ( 4 )2 , 9 = ( 5 )8 , and 10 = ( 5 )2 .
That is, the defining set can be extended to

, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10
(all these elements are distinct, as is a primitive 31st root of unity). In fact, code
X has designed distance 11. Hence, the minimum distance in X is 11.
Problem 2.17
Let X be a linear [N, k, d] code over the binary field F2 , and
G be a generating matrix of X , with k rows and N columns, such that exactly
d of the first rows entries are 1. Let G1 be the matrix, of k 1 rows and N d
columns, formed by deleting the first row of G and those columns of G with a
non-zero entry in the first row. Show that X1 , the linear code generated by G1 ,
has minimum distance d d/2. Here, for a real number x, x is the integer
satisfying x x < x + 1.
Show also that X1 has rank k 1. Deduce that
3 i4
N
d 2 .
0ik1

Solution Let x be the codeword in X represented by the first row of G and pick
a pair of other rows, say y and z. After the first deleting they become y and z ,
correspondingly. Both weights w(y ) and w(z ) must be d/2: otherwise at least
one of the original words y and z, say y, would have had minimum d/2 digits 1
among deleted d digits (as w(y) d by condition). But then
w(x + y) = w(y ) + d d/2 < d
which contradicts the condition that the distance of X is d.
We want to check that the weight w(y + z ) d/2. Assume the opposite:
w(y + z ) = m < d/2 .
Then m = w(y0 + z0 ) must be d m d/2 where y0 is the deleted part of y,
of length d, and z0 is the deleted part of z, also of length d. In fact, as before, if
m < d m then w(y + z) < d which is impossible. But if m d m then
w(x + y + z) = d m + m < d,
again impossible. Hence, the sum of any two rows of G1 has weight d/2.

2.6 Additional problems for Chapter 2

263

This argument can be repeated for the sum of any number of rows of G1 (not
exceeding k 1). In fact, in the case of such a sum x + y + + z, we can pass to
new matrices, G and G1 , with this sum among the rows. We conclude that X1 has
minimum distance d d/2. The rank of X1 is k 1, for any k 1 rows of G1
are linearly independent. (The above sum cannot be 0.)
Now, the process of deletion can be applied to X1 (you delete d columns in
G1 yielding digits 1 in a row of G1 with exactly d digits 1). And so on, until you
exhaust the initial rank k by diminishing it by 1. This leads to the required bound

4
3
N d + d/2 + d/22 + + d 2k1 .

Problem 2.18 Define a cyclic linear code X and show that it has a codeword of
minimal length which is unique, under normalisation to be stated. The polynomial
g(X) whose coefficients are the symbols of this codeword is the (minimum degree)
generator polynomial of this code: prove that all words of the code are related to
g(X) in a particular way.
Show further that g(X) can be the generator polynomial of a cyclic code with
words of length N iff it satisfies a certain condition, to be stated.
There are at least three ways of determining the parity-check matrix of the code
from a knowledge of the generator polynomial. Explain one of them.

Solution Let X be the cyclic code of length N with generator polynomial g(X) =
gi X i of degree d. Without loss of generality, assume the code is non-trivial,
0id

with 1 < d < N 1. Let g denote the corresponding codeword g0 . . . gd 0 . . . 0 (there

are d + 1 coefficients gi completed with N d 1 zeros). Then:
(a) g(X)|(X N 1), i.e. g(X)h(X) = X N 1 for some polynomial h(X) = hi X i
0ik

of degree k = N d;
(b) a string a = a0 . . . aN1 X iff the polynomial a(X) =

ai X i has the

0iN1

form a(X) = f (X)g(X) mod (X N 1);

(c) a string g and its cyclic shifts g, . . . , k1 g (corresponding to polynomials
g(X), Xg(X), . . . , X k1 g(X)) form a basis in X .
By virtue of (a), g0 = h0 = 1 and the sum gi hli representing the lth coefficient
0il

of g(X)h(X) is equal to 0, for all l = 1, . . . , N 1. By virtue of (c), the rank of X

equals k.

264

Introduction to Coding Theory

One way to specify the parity-check matrix is to take the ratio (X N 1) g(X) =
h(X) = h0 + h1 X + + hk X k . Then form the N (N k) matrix

0 ... 0 0
hk hk1 . . .
0
hk hk1 . . . h1 . . . 0
.
(2.6.9)
H =
. . . . . .
. . . . . . . . . . . . . . .
0
0
. . . hk . . . h1 h0
j
The rows of H are the cyclic shifts h , 0 j d 1 = N k 1, of the string
h = hk . . . h0 0 . . . 0.
We claim that for all a X , aH T = 0. In fact, it suffices to check that for the
basis words j g, j gH T = 0, j = 0, . . . , k 1. That is, the dot-product

j1 g j2 h = 0, 0 j1 < k, 0 j2 < N k 1.

(2.6.10)

But for j1 = k 1 and j2 = 0, we have

k1 g h = g0 hk + g1 hk1 = 0
since it gives the first coefficient (at monomial X) of the product g(X)h(X) =
X N 1. Similarly, for j1 = k 2 and j2 = 0, k2 g h gives the second coefficient
of g(X)h(X) (at monomial X 2 ) and is again equal to 0. And so on: for j1 = j2 = 0,
g h = 0 as the kth-degree coefficient in g(X)h(X).
Continuing, g h equals the (k + 1)st-degree coefficient in g(X)h(X), g 2 h
the (k + 2)nd, and so on; g Nk1 h = gd1 hk + gd hk1 the (N 1)st. As before,
they all vanish.
The same holds true when we simultaneously shift both words cyclically (when
possible) which leads to (2.6.10).
Conversely, suppose that aH T = 0 for some word a = a0 . . . aN1 . Write the corresponding polynomial a(X) as a(X) = f (X)g(X) + r(X) where the ratio f (X) =
fi X i and r(X) is the remainder. Then either r(X) = 0 or 1 deg r(X) =
0ik1

d < d (and rd = 1 and rl = 0 for d < l n 1). Then set r = r0 . . . rd .

Assume that r(X) = 0. By the above argument,
(i) aH T = rH T and hence rH T = 0,
(ii) the entries of vector rH T coincide with the coefficients of the product
r(X)h(X), beginning with r0 hk + + rd hkd and ending with rd hk . So, these
coefficients must be 0. But the equality rd hk = 0 is impossible since rd = hk = 1.
Hence, r(X) = 0 and a(X) = f (X)g(X), i.e. a X . We conclude that H is the
parity-check matrix for X .
Equivalently, H is the matrix formed by the words corresponding to polynomials
X i h (X) where
h (X) =

0ik

hi X ki .

2.6 Additional problems for Chapter 2

265

Alternatively, let h(X) be the check polynomial for the cyclic code X length N
with a generator polynomial g(X) so that g(X)h(X) = X N 1. Then:
(a) X = { f (X): f (X)h(X) = 0 mod (X N e)};
(b) if h(X) = h0 + h1 X + + hNr X Nr then the parity-check matrix H of X
has the form (2.6.9);
(c) the dual code X is a cyclic code of dim X = r, and X = "h (X)#,
Nr h(X 1 ) = h1 (h X Nr + h X Nr1 + + h
where h (X) = h1
0
1
Nr ).
0 X
0
Problem 2.19
Consider the parity-check matrix H of a Hamming [2 1, 2
1] binary code. Form the parity-check matrix H of a [2 , 2 1] code by
augmenting H with a column of zeros and then with a row of ones. The dual of
the resulting code is called a first-order ReedMuller code. Show that a first-order
ReedMuller code can correct errors of up to 22 1 bits per codeword.
For the photographs of Mars taken by the Mariner spacecraft such code with
= 5 was used in 1972. What was the code rate? Why is this likely to have been
much less than the capacity of the channel?
Solution The code in question is [2 , + 1, 21 ]; with = 5, the information rate
equals 6/32 1/5. Let us check that all codewords except 0 and 1 have weight
21 . For 1 the code R() is defined by recursion
R( + 1) = {xx|x R()} {x, x + 1|x R()}.
So, the length of codewords in R( + 1) is obviously 2+1 . As {xx|x R()}
and {x, x + 1|x R()} are disjoint, the number of codewords is doubled, i.e.
R( + 1) = 2+2 . Finally, assuming that all codewords in R() except 0 and 1
have weight 21 , consider a codeword y R(l + 1). If y = xx is different from 0
or 1, then x = 0 or 1, and so w(y) = 2w(x) = 2 21 = 2 .
If y = x, x + 1 we must consider some cases. If x = 0 then y = 01, which has
weight 2l . If x = 1 then y = 10, which also has weight 2l . Finally, if x = 0 or 1
then w(x + 1) = 2 21 = 21 and w(y) = 2 21 = 2 . It is clear now that
codewords xx and x, x + 1 with w(x) = 21 are orthogonal to rows of parity-check
matrix H .
Up to 7 bits may be in error, thus the probability of a transmission error pe (for
a binary symmetric memoryless channel with the error-probability p) obeys

32 i
pe 1
p (1 p)32i ,
i
0i7
which is small when p is small. (As an estimate of an acceptable p, we can take the
solution to 1 + p log p + (1 p) log(1 p) = 26/32.) If the block length is fixed
(and rather small), with a low value of p we cant get near the capacity.

266

Introduction to Coding Theory

Indeed, for = 5, the code is [32, 6, 16], detecting 15 and correcting 7 errors. That
is, the code can correct a fraction > 1/5 of the total of 32 digits. Its information
rate is 6/32 and if the capacity of the (memoryless) channel is C = 1 (p) (where
p stands for the symbol-probability of error), we need the bound C > 6/32; that
is, (p) + 6/32 < 1, for a reliable transmission. This yields |p 1/2| > |p 1/2|
where p (0, 1) solves 26/32 = (p ). Definitely 0 p < 1/5 and 4/5 < p 1
would do. In reality the error-probability was much less.
Problem 2.20 Prove that any binary [5, M, 3] code must have M 4. Verify that
there exists, up to equivalence, exactly one [5, 4, 3] code.
Solution By the Plotkin bound, if d is odd and d > 12 (N 1) then
M2 (N, d) 2

d +1
.
2d + 1 N

In fact,
M2 (5, 3) 2

4
= 2 2 = 4.
6+15

All [5, 4, 3] codes are equivalent to 00000, 00111, 11001, 11110.

Problem 2.21 Let X be a binary [N, k, d] linear code with generating matrix
G. Verify that we may assume that the first row of G is 1 . . . 1 0 . . . 0 with d ones.
Write:

1...1 0...0
.
G=
G2
G1

Show that if d2 is the distance of the code with generating matrix G2 then d2 d/2.
Solution Let X be [N, k, d]. We can always form a generating matrix G of X where
the first row is a codeword x with w(x) = d; by permuting columns of G we can
have the first row in the form 1 . . . 1d 0 . . . 0Nd . So, up to equivalence,
: ;< = : ;< =

1...1 0...0
G=
.
G1
G2
Suppose d(G2 ) < d/2 then, without loss of generality, we may assume that there
exists a row of (G1 G2 ) where the number of ones among digits d + 1, . . . , N is
< d/2. Then the number of ones among digits 1, . . . d in this row is > d/2, as its
total weight is d. Then adding this row and 1 . . . 1 0 . . . 0 gives a codeword with
weight < d. So, d(G2 ) d/2.

2.6 Additional problems for Chapter 2

267

Problem 2.22 (GilbertVarshamov bound) Prove that there exists a p-ary linear
[N, k, d] code if pk < 2N /vN1 (d 2). Thus, if pk is the largest power of p satisfying
this inequality, we have Mp (N, d) pk .
Solution We construct a parity-check matrix by selecting N columns of length
N k with the requirement that no d 1 columns are linearly dependent. The first
column may be any non-zero string in ZNk
p . On the step i 2 we must choose
a column which is not a linear combination of any d 2 (or fewer) of previously
selected columns. The number of such linear combinations (with non-zero coefficients) is

d2
i1
Si =
(p 1) j .
j
j=1
So, the parity-check matrix may be constructed iff SN + 1 < pNk . Finally, observe
that SN + 1 = vN1 (d 2). Say, there exists [5, 2k , 3] code if 2k < 32/5, so k = 2
and M2 (5, 3) 4, which is, in fact, sharp.
Problem 2.23 An element b Fq is called primitive if its order (i.e. the minimal
k such that bk = 1 mod q) is q 1. It is not difficult to find a primitive element of
the multiplicative group Fq explicitly. Consider the prime factorisation
s

q 1 = pj j.
j=1

(q1)/p j

For any j = 1, . . . , s select a j Fq such that a j

check that b = sj=1 b j has the order q 1.

(q1)/p j

= e. Set b j = a j

and

Solution Indeed, the order of b j is p j j . Next, if bn = 1 for some n then n = 0

mod p j j because bn i= j pi = 1 implies b j i= j i = 1, i.e. n i= j pi i = 0 mod pj i =

0. Because p j are distinct primes, it follows that n = 0 mod p j j for any j. Hence,

n = sj=1 p j j .
p

Problem 2.24 The minimal polynomial with a primitive root is called a primitive
polynomial. Check that among irreducible binary polynomials of degree 4 (see
(2.5.9)), 1 + X + X 4 and 1 + X 3 + X 4 are primitive and 1 + X + X 2 + X 3 + X 4 is
not. Check that all six irreducible binary polynomials of degree 5 (see (2.5.15))
are primitive; in practice, one prefers to work with 1 + X 2 + X 5 as the calculations
modulo this polynomial are slightly shorter. Check that among the nine irreducible
polynomials of degree 6 in (2.5.16), there are six primitive: they are listed in the
upper three lines. Prove that a primitive polynomial exists for every given degree.

268

Introduction to Coding Theory

Solution For the solution to the last part, see Section 3.1.
Problem 2.25 A cyclic code X of length N with the generator polynomial g(X)
of degree d = N k can be described in terms of the roots of g(X), i.e. the elements
1 , . . . Nk such that g( j ) = 0. These elements are called zeros of code X and
belong to a Galois field F2d . As g(X)|(1+X N ), they are also among roots of 1+X N .
That is, Nj = 1, 1 j N k, i.e. the j are N th roots of unity. The remaining k
roots of unity 1 , . . . , k are called non-zeros of X . A polynomial a(X) X iff,
in Galois field F2d , a( j ) = 0, 1 j N k.
(a) Show that if X is the dual code then the zeros of X are 1 1 , . . . , k 1 , i.e.
the inverses of the non-zeros of X .
(b) A cyclic code X with generator g(X) is called reversible if, for all x =
x0 . . . xN1 X , the word xN1 . . . x0 X . Show that X is reversible iff g( ) = 0
implies that g( 1 ) = 0.
(c) Prove that a q-ary cyclic code X of length N with (q, N) = 1 is invariant under
the permutation of digits such that q (i) = qi mod N (i.e. x xq ). If s = ordN (q)
then the two permutations i i + 1 and q (i) generate a subgroup of order Ns in
the group Aut(X ) of the code automorphisms.
Solution Indeed, since a(xq ) = a(x)q is proportional to the same generator polynomial it belongs to the same cyclic code as a(x).
Problem 2.26 Prove that there are 129 non-equivalent cyclic binary codes of
length 128 (including the trivial codes, {0 . . . 0} and {0, 1}128 ). Find all cyclic binary codes of length 7.
Solution The equivalence classes of the cyclic codes of length 2k are in a one-tok
one correspondence with the divisors of 1+X 2 ; the number of those equals 2k +1.
Furthermore, there are eight codes listed by their generators which are divisors of
X 7 1 as
X 7 1 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).

3
Further Topics from Coding Theory

3.1 A primer on finite fields

In this section we present a summary of the theory of finite fields, limiting our scope
by material needed in the subsequent sections and following standard texts (see
[92], [93], [131]). A finite field is a (finite) set F possessing two distinct elements,
0 (zero) and e (unity), and equipped with two commutative group operations of
addition and multiplication (where 0 b = 0 for all b F) related by a standard
distributivity rule.
A vector space over a field F is a (finite) set V, equipped with a commutative
group operation of addition, and an operation of scalar multiplication by elements
of F, again obeying standard distributivity rules. The dimension dim V of V is the
minimal number d such that any collection of distinct elements v1 , . . . , vd+1 V
is linearly dependent, i.e. one can find elements k1 , . . . , kd+1 F, not all equal to
0, such that k1 v1 + + kd+1 vd+1 = 0. Then there exists a collection of elements
b1 , . . . , bd V, called a basis, such that every v V can be written as a linear combination a1 b1 + +ad bd where a1 , . . . , ad are elements of F (uniquely) determined
by v. Unless the opposite is stated, we consider fields up to an isomorphism.
An important parameter of a field is its characteristic, i.e. the minimal integer
number p 1 such that pe = e + + e (p times) = 0. Such a number, denoted by
char(F), exists by a standard pigeon-hole principle. Furthermore, the characteristic
is a prime number: if p = q1 q2 then pe = (q1 q2 )e = (q1 e)(q2 e) = 0 which implies
that q1 e = 0 or q2 e = 0 leading to a contradiction.
Example 3.1.1
Let p be a prime number. An additive cyclic group Z p =
{0, 1, . . . , p 1}, with a generator 1, becomes a field with the multiplication
(qe)(q e) = (qq )e. The characteristic of this field equals p.
Let K and F be fields. If F K we say that K is an extension of F. Then K is
also a vector space over F whose dimension is denoted by [K : F].
Lemma 3.1.2

Let K be an extension of F, and d = [K : F]. Then K = ( F)d .

269

270

Further Topics from Coding Theory

Proof Let b1 , . . . , bd be a basis for K over F, with a unique representation

k = a j b j for all k K. Then for all j, we have F possibilities for a j . So,
1 jd

altogether there exists precisely ( F)d ways to write all combinations.

Lemma 3.1.3

If char(F) = p then F = pd , for some integer d 1.

Proof Consider elements 0, e, 2e, . . . , (p 1)e. They form Z p , i.e. Z p F. Then

F = pd by Lemma 3.1.2.
Corollary 3.1.4 The number of elements in a finite field F must be q = ps where
p = char(F) and s 1 is a natural number.
From now on, unless otherwise stated, p stands for a prime and q = ps for a
prime power.
Lemma 3.1.5
integers n 1,

(A freshmans dream) If char(F) = p then for all a, b F and

(a b) p = a p + (b) p .

(3.1.1)

Proof Use induction in n: for n = 1,

(a b) p =

p
k ak (b) pk .
0kp

p
For 1 k p 1, the value
is a multiple of p and the corresponding term
k
vanishes. Therefore, (a b) p = a p + (b) p . The inductive step is completed by the
n1
n1
same argument, with a and b replaced by a p and (b) p .
Lemma 3.1.6 The multiplicative group F of non-zero elements of a field F of
size q is isomorphic to the cyclic group Zq1 .
Proof Observe that for any divisor d|(q 1), group F contains exactly (d)
elements of multiplicative order d where is Eulers totient phi-function. (Recall
that (d) = {k : k < d, gcd(k, d) = 1}.) Well see that all elements of order d have
q1
the form a d r where a is a primitive element, r d and r, d are co-prime. In fact,
q 1 = (d), and F will have at least one element of order q 1 which
d:d|(q1)

implies that F is cyclic, of order q 1.

Let a F be an element of order d where d|(q 1). Take the cyclic subgroup
{e, a, . . . , ad1 }. Every element of this subgroup has multiplicative order dividing
d, i.e. is a root of the polynomial X d e (a dth root of unity). But X d e has
d distinct roots in F (because F is a field). So, {e, a, . . . , ad1 } is the set of all
roots of X d e in F. In particular, each element from F of order d belongs to

3.1 A primer on finite fields

271

{e, a, . . . , ad1 }. Observe that the cyclic group Zd has exactly (d) elements of
order d. So, the whole F has exactly (d) elements of order d; in other words, if
(d) is the number of elements in F of order d then either (d) = 0 or (d) = (d)
and
q1 =

(d)

d:d(n)

(d) = q 1,

d:d|n

which implies that for all d|n,

(d) = (d).
Definition 3.1.7 A (multiplicative) generator of F (i.e. an element of multiplicative order q 1) is called a primitive element of field F. Although such an element
is non-unique, we will usually single out one such element and denote it by ;
of course a power r where r is coprime with (q 1) will also give a primitive
element.
If a F with F = q 1 then aq1 = e (the order of every element divides the
order of the group). Hence, aq = a, i.e. a is a root of the polynomial X q X in F.
But X q X can have only q roots (including zero 0), so F gives the set of all
roots of X q X.
Definition 3.1.8 Given fields K and F, with F K, field K is called the splitting
field for a polynomial g(X) with coefficients from F if (a) K contains all roots of
g(X), (b) there is no field K with F K K satisfying (a). We will write Spl(g(X))
for the splitting field for g(X).
Thus, if F = q then F contains all roots of polynomial X q X and is the splitting
field for this polynomial.
Lemma 3.1.9 Any two splitting fields K, K for the same polynomial g(X) with
coefficients from F coincide.
Proof In fact, take the intersection K K : it contains F and is a subfield of both
K and K . It must then coincide with each of K, K .
Corollary 3.1.10 For any prime p and natural s 1, there exists at most one
field with ps elements.
Proof Each such field is splitting for polynomial X q X with coefficients from
Z p and q = ps . So any two such fields coincide.
On the other hand, we will prove later the following.
Theorem 3.1.11 For any non-constant polynomial with coefficients from F, there
exists a splitting field.

272

Further Topics from Coding Theory

Corollary 3.1.12 For any prime p and natural s 1, there exists precisely one
field with ps elements.
Proof of Corollary 3.1.12 Take again the polynomial X q X with coefficients from
Z p and q = ps . By Theorem 3.1.11, there exists the splitting field Spl(X q X)
where X q X = X(X q1 e) is factorised into linear polynomials. So, Spl(X q X)
contains the roots of X q X and has characteristic p (as it contains Z p ).
However, the roots of (X q X) form a subfield: if aq = a and bq = b then (a
q
b) = aq + (bq ) (Lemma 3.1.5) which coincides with a b. Also, (ab1 )q =
aq (bq )1 = ab1 . This field cannot be strictly contained in Spl(X q X) thus it
coincides with Spl(X q X).
It remains to check that all roots of (X q X) are distinct: then the cardinality
Spl(X q X) will be equal to q. In fact, if X q X had a multiple root then it would
have had a common factor with its derivative X (X q X) = qX q1 e. However,
qX q1 = 0 in Spl(X q X) and thus cannot have such factors.
Summarising, we have the two characterisation theorems for finite fields.
Theorem 3.1.13 All finite fields have size ps where p is prime and s 1 integer.
For all such p, s, there exists a unique field of this size.
The field of size q = ps will be denoted by Fq (a popular alternative notation is
GF(q) (a Galois field)). In the case of the simplest fields F p = {0, 1, . . . , p 1} (for
p is prime) we use symbol 1 instead of e for the unit.
Theorem 3.1.14 All finite fields can be arranged into sequences (towers). For
a prime p and positive integers s1 , s2 , . . .,
...
...
...

F ps1 s2 ...si

...

F ps1 s2

F ps1

Fp Zp

Here each arrow is a uniquely defined injective homomorphism.

3.1 A primer on finite fields

273

Example 3.1.15 In Section 2.5 we worked with polynomial fields F2 [X]/"q(X)#

where q(X) is an irreducible binary polynomial; see Theorems 2.5.32 and 2.5.33.
Continuing Example 2.5.35(c), consider the field F16 realised as F2 [X]/"1 + X 3 +
X 4 #. The field structure is as follows:
power
of X
mod 1 + X 3 + X 4

X0
X
X2
X3
X4
X5
X6
X7
X8
X9
X 10
X 11
X 12
X 13
X 14

polynomial

vector
(string)

0
1
X
X2
X3
1 + X3
1 + X + X3
1 + X + X2 + X3
1 + X + X2
X + X2 + X3
1 + X2
X + X3
1 + X2 + X3
1+X
X + X2
X2 + X3

0000
1000
0100
0010
0001
1001
1101
1111
1110
0111
1010
0101
1011
1100
0110
0011

(3.1.2)

Finally, if we choose to specify the table for F2 [X]/"1 + X + X 2 + X 3 + X 4 #,

the calculations will be considerably longer (and organised differently). The point
is that in this case monomial X will not be a primitive element, since X 5 = 1
mod (1 + X + X 2 + X 3 + X 4 ). Instead, a generator of the multiplicative group will
be a sum of monomials, viz. 1 + X.
Worked Example 3.1.16 (a) How many elements are in the smallest extension
of F5 which contains all roots of polynomials X 2 + X + 1 and X 3 + X + 1?
(b) Determine the number of subfields of F1024 , F729 . Find all primitive elements
of F7 , F9 , F16 . Compute ( 10 + 5 )( 4 + 2 ) where is a primitive element of
F16 .
Solution (a) Clearly, 56 .
(b) F1024 = F210 has 4 subfields: F2 , F4 , F32 and F1024 . F729 = F36 has 4 subfields:
F3 , F9 , F27 and F729 . F7 has 2 primitive elements: , 5 (with ( 5 )5 = ). F9 has

274

Further Topics from Coding Theory

4 primitive elements, of the form , 3 , 5 , 7 . F16 has 8 primitive elements: ,

2 , 4 , 7 , 8 , 11 , 13 , 14 .
By using the table for F2 [X]/"1 + X + X 4 # (see Example 2.5.35(c)), we can find
that
( 10 + 5 )( 4 + 2 ) = 14 + 9 + 12 + 7
= 1001 + 0101 + 1111 + 1101 = 1110
= 10 .
However, by taking = 7 , the RHS becomes

8 + 3 + 9 + 4 = 1010 + 0001 + 0101 + 1100 = 0010 = 2 = ( )11 .

From now on we will focus on polynomial representations of finite fields. Generalising concepts introduced in Section 2.5, consider
Definition 3.1.17 The set of all polynomials with coefficients from Fq is a commutative ring denoted by Fq [X]. A quotient ring Fq [X]/"g(X)# is where the operation is modulo a fixed polynomial g(X) Fq [X].
Definition 3.1.18 A polynomial g(X) Fq [X] is called irreducible (over Fq ) if it
admits no representation
g(X) = g1 (X)g2 (X)
with g1 (X), g2 (X) Fq [X].
A generalisation of Theorem 2.5.32 is presented in Theorem 3.1.19 below.
Theorem 3.1.19
Let g(X) Fq [X] have degree deg g(X) = d . Then
Fq [X]/"g(X)# is a field Fqd iff g(X) is irreducible.
Proof Let g(X) be an irreducible polynomial over Fq . To show that Fq [X]/"g(X)#
is a field we should check that each non-zero element f (X) Fq [X]/"g(X)# has an
inverse. Consider the set F( f ) of polynomials of the form f (X)h(X) mod g(X)
where h(X) Fq [X]/"g(X)# (the principal ideal generated by f (X)). If F( f ) contains the unity e Fq (the constant polynomial equal to e) then the corresponding
h(X) = f (X)1 . If not, the map h(X) f (X)h(X) mod g(X), from Fq [X]/"g(X)#
to itself, is not a surjection. That is, f (X)h1 (X) = f (X)h2 (X) mod g(X) for some
distinct h1 (X), h2 (X), i.e.

f (X) h1 (X) h2 (X) = r(X)g(X).

3.1 A primer on finite fields

275

Then either g(X)| f (X) or g(X)| h1 (X) h2 (X) as g(X) is irreducible. So, either f (X) = 0 mod g(X) (a contradiction) or h1 (X) = h2 (X) mod g(X). Hence,
Fq [X]/"g(X)# is a field.
The inverse assertion is proved similarly: if g(X) is reducible then Fq [X]/"g(X)#
contains non-zero g1 (X), g2 (X) with g1 (X)g2 (X) = 0. Then Fq [X]/"g(X)# cannot
be a field.

The dimension Fq [X]/"g(X)# : Fq is equal to d, the degree of g(X), so
Fq [X]/"g(X)# = Fqd .
Worked Example 3.1.20
Prove that

g(X) has an inverse in the polynomial ring
Fq [X]/"X N e# iff gcd g(X), X N e = e.
Solution Consider the map Fq [X]/"X N e# Fq [X]/"X N e# given by h(X)
h(X)g(X) mod (X N e). If it is a surjection then there exists h(X) with
h(X)g(X) = e and h(X) = g(X)1 . Suppose it is not. Then there exist h(1) (X) =
h(2) (X) mod (X N e) such that h(1) (X)g(X) = h(2) (X)g(X) mod (X N e), i.e.
(h(1) (X) h(2) (X))g(X) = s(X)(X N e).
As (X N e) | (h(1) (X) h(2) (X)), this means that gcd(g(X), X N e) = e.
Conversely, if gcd(g(X), X N e) = d(X) = e then the equation h(X)g(X) = e
mod (X N e) gives
h(X)g(X) = e + q(X)(X N e)
where d(X)|LHS and d(X)|q(X)(X N e)). Therefore, d(X)|e: a contradiction.
Hence, g(X)1 does not exist.
Example 3.1.21 (Continuing Example 2.5.19) There are six irreducible binary
polynomials of degree 5:
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5,
1 + X 2 + X 3 + X 4 + X 5.

(3.1.3)

Then there are nine irreducible polynomials of degree 6, and so on. Calculating
irreducible polynomials of a large degree is a demanding task, although extensive
tables of such polynomials are now available on the web.
We are now going to prove Theorem 3.1.11.
Proofof Theorem 3.1.11 The key fact is that any non-constant polynomial g(X)
Fq [X] has a root in some extension of Fq . Without loss of generality, assume that
g(X) is irreducible, with deg g(X) = d. Take Fq [X]/"g(X)# = Fqd as an extension
field. In this field, g( ) = 0 where is polynomial X Fq [X]/"g(X)#, so g(X)

276

Further Topics from Coding Theory

has a root. We can divide g(X) by X in Fqd and use the same construction
to prove that g1 (X) = g(X)/(X ) has a root in some extension of Fqt ,t < d.
Finally, we obtain a field containing all d roots of g(X), i.e. construct the splitting
field Spl(g(X)).
Definition 3.1.22 Given a field F K and an element K, we denote by
F( ) the smallest field containing F and (obviously, F F( ) K). Similarly,
F(1 , . . . , r ) is the smallest field containing F and elements 1 , . . . , r K. For
F = Fq and K, set

d1
M ,F (X) = (X )(X q ) . . . X q
,

(3.1.4)

where d is the smallest positive integer such that q = (such a d exists as will
be proved in Lemma 3.1.24).
A monic polynomial is the one with the highest coefficient. The minimal polynomial for K over F is a unique monic polynomial M (X) (= M ,F (X))
F[X] such that M ( ) = 0 and M (X)|g(X) for each g(X) F[X] with g( ) = 0.
When is a primitive element of K (generating K ), M (X) is called a primitive
polynomial (over F). The order of a polynomial p(X) F[X] is the smallest n such
that p(X)|(X n e).
Example 3.1.23 (Continuing Example 3.1.21.) In this example we deal with
polynomials over F2 . The irreducible polynomial X 2 + X + 1 is primitive and has
order 3. The irreducible polynomials X 3 + X + 1 and X 3 + X 2 + 1 are primitive and
of order 7. The polynomials X 4 + X 3 + 1 and X 4 + X + 1 are primitive and have
order 15 whereas X 4 + X 3 + X 2 + X + 1 is not primitive and of order 5. (It is helpful
to note that with d = 4, the order of X 4 +X 3 +1 and X 4 +X +1 equals 2d 1; on the
other hand, the order of element X in the field F2 [X]/"1+X +X 2 +X 3 +X 4 # equals
5, but its order, say, in the field F2 [X]/"1 + X + X 4 # equals 15.) All six polynomials
listed in (3.1.3) are primitive and have order 31 (i.e. appear in the decomposition
of X 31 + 1).
Lemma 3.1.24 Let Fq Fqd and Fqd . Let M (X) F[X] be the minimal
polynomial for , of degree deg M (X) = d . Then:
(a) M (X) is the only irreducible polynomial in Fq [X] with a root at .
(b) M (X) is the only monic polynomial in Fq [X] of degree d with a root at .
(c) M (X) has the form (3.1.4).

3.1 A primer on finite fields

277

Proof Assertions (a), (b) follow from the definition. To prove (c), assume K is
a root of a polynomial f (X) = a0 + a1 X + + ad X d from F[X], i.e. ai i = 0.
0id

As aqi = ai (which is true for all a F) and by virtue of Lemma 3.1.5,

q
i q
q
qi
i
f ( ) = ai = ai =
ai = 0,
0id

0id

q
2
so q is a root. Similarly, q = q is a root, and so on.
2
s
For M (X) it yields that , q , q , . . . are roots. This will end when q =
for the first time (which proves the existence of such an s). Finally, s = d as all
d1
i
j
, q , . . . , q are distinct: if not then q = q where, say, i < j. Taking qd j
d+i
j
d
power of both sides, we get q
= q = . So, is a root of polynomial
d+i
j
X, and Spl(P(X)) = Fqd+i j . On the other hand, is a root of an
P(X) = X q
irreducible polynomial of degree d, and Spl(M (X)) = Fqd . Hence, d|(d + i j)
i
or d|(i j), which is impossible. This means that all the roots q , i < d, are
distinct.
Theorem 3.1.25 For any field Fq and integer d 1, there exists an irreducible
polynomial f (X) Fq [X] of degree d .
Proof Take a primitive element Fqd . Then Fq ( ), the minimal extension of
Fq containing , coincides with Fqd . The dimension [Fq ( ) : Fq ] of vector space
Fq ( ) over Fq equals [Fqd : Fq ] = d. The minimal polynomial M (X) for over
d1
Fq has distinct roots , q , . . . , q and therefore is of degree d.
Although proving irreducibility of a given polynomial is a problem with no general solution, the number of irreducible polynomials of a given degree can be evaluated by using an elegant (and not very complicated) method invoking the so-called
Mobius function.
Definition 3.1.26

The Mobius function on the set Z+ is given by

(1) = 1, (n) = 0 if n is divisible by a square of a prime number,

and

(n) = (1)k if n is a product of k distinct prime numbers.

Theorem 3.1.27 The number Nq (n) of irreducible polynomials of degree n in
the polynomial ring Fq [X] is given by
Nq (n) =

1
(d)qn/d .
n d:
d|n

(3.1.5)

278

Further Topics from Coding Theory

For example, Nq (20) equals

1
(1)q20 + (2)q10 + (4)q5 + (5)q4 + (10)q2 + (20)q
20

1 20
q q10 q4 + q2 .
=
20
Proof First, we establish the additive Mobius inversion formula. Let and be
two functions from Z+ to an Abelian group G with an additive group operation.
Then the following equations are equivalent:
(n) = (d)

(3.1.6)

d|n

and

(n) = (d)
d|n

n
d

(3.1.7)

This equivalence follows when we observe that (a) the sum (d) is equal to 0 if
d|n

n > 1 and to 1 if n = 1, and (b) for all n,

(d) n/d = (d) (c)
d: d|n

d: d|n

c: c|n/d

= (c) (d) = (n).

c: c|n

d: d|n/c

To check (a), let p1 , . . . , pk be different prime factors in decomposition of n then

(d) = (1) + (pi ) + + (p1 . . . pk )

d|n

i=1
k
k
k
(1)k = 0.
= 1+
(1) +
(1)2 + +
k
1
2
Applying (3.1.7) to G = Z, the additive group of integer numbers, with (n) =
nNq (n) and (n) = qn , gives (3.1.5).
n
Now, decompose the polynomial X q X into the product of irreducible polyn
nomials. Then (3.1.6) holds true as the degree qn of X q X coincides with the
sum of degrees of all irreducible polynomials whose degrees divide n. Indeed, we
n
simply write X q X as the product of all irreducible polynomials and observe
that an irreducible polynomial enters the decomposition iff its degree divides n (cf.
Corollary 3.1.30).
Worked Example 3.1.28 Find all irreducible polynomials of degree 2 and 3 over
F3 and determine their orders.

3.1 A primer on finite fields

279

Solution Over F3 = {0, 1, 2} there are three irreducible polynomials of degree 2:

X 2 + 1, of order 4, with
(X 4 1)/(X 2 + 1) = X 2 1,
and X 2 + X + 2 and X 2 + 2X + 2, of order 8, with
(X 8 1)/(X 2 + X + 2)(X 2 + 2X + 2) = X 4 1.
Next, there exist (33 3)/3 = 8 irreducible polynomials over F3 of degree 3.
Four of them have order 13 (hence, are not primitive):
X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 + 2X 2 + 2X + 2.
The remaining four have order 26 (hence, are primitive):
X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1, X 3 + 2X 2 + X + 1.
Indeed, if p(X) denotes the product of the first four polynomials then (X 13
1)/p(X) = X 1. On the other hand, if r(X) stands for the product of the remaining
four then (X 26 1)/r(X) equals
(X 1)(X + 1)(X 3 + 2X + 2)(X 3 + X 2 + 2)
(X 3 + X 2 + X + 2)(X 3 + 2X 2 + 2X + 2).

Theorem 3.1.29 If g(X) Fq [X] is irreducible and of degree d and is a root

of g(X) then the splitting field Spl(g(X)) and the minimal extension Fq ( ) both
coincide with Fqd .
Proof We know that g(X) = M ,Fq (X) = irr ,Fq (X) (by Lemma 3.1.24, as g(X)
is irreducible). We then have that Fq Fq ( ) = Fqd Spl(g(X)). It is left to check
that any root of g(X) lies in Fq ( ): this will imply that Spl(g(X)) Fq ( ).
By Theorem 3.1.13, the unique Galois field with qd elements Fq ( ) = Fqd is the
qd X), i.e. contains all roots of X qd X (one of which is ).
splitting field

Spl(X
d
Then g(X)| X q X as g(X) = M ,Fq (X). Therefore, all roots of g(X) are roots
d
of X q X and hence lie in Fq ( ).
Corollary 3.1.30 Suppose
n
that g(X) Fq [X] is an irreducible polynomial of
degree d . Then g(X)| X q X iff d|n.
qn

Proof We have the splitting fields Spl(g(X))
=
F

X
= Fqn . By
d and Spl X
q

n
q
Theorem 3.1.29, Spl(g(X)) Spl X X iff d|n.
n

n
Now if g(X)| X q X , each root of g(X) is a root of X q X . Then
n
Spl(g(X)) Spl X q X and hence d|n.

280

Further Topics from Coding Theory

n

Conversely,
Spl X q X , then each root of
lies
qn
if d|n, i.e.
Spl(g(X))

g(X)
in
n
qn X , so
Spl X X . But Spl X q X is precisely
the
set
of
the
roots
of
X

n

n
each root of g(X) is that of X q X . Then g(X)| X q X .
Theorem 3.1.31 If g(X) Fq [X] is an irreducible polynomial of degree d and
d1
Spl(g(X)) = Fqd [X] is its root then all the roots of g(X) are , q , . . . , q .
d
Furthermore, d is the smallest positive integer such that q = .
d1

Proof As in the proof of Lemma 3.1.24, , q , . . . , q are distinct roots. Thus

all the roots are listed and d is the smallest positive integer with the above property.
Corollary 3.1.32 All roots of an irreducible polynomial g(X) Fq [X] with
deg g(X) = d have in Spl(g(X)) the same multiplicative order dividing qd 1, and
it gives the order of polynomial g(X) (see Definition 3.1.22).
The order of irreducible polynomial g(X) will be denoted by ord(g(X)).
Worked Example 3.1.33 (a) Prove that for natural n, q such that lcm(n, q) = 1
there exists a natural s such that n|(qs 1).
(b) Prove that if a polynomial g(X) F2 [X] is irreducible then g(X)|(X n 1) iff
ord(g(X))|n.
Solution (a) Set ql 1 = nal + bl where bl n and l = 1, 2, . . .. By the pigeon-hole
principle, bl1 = bl2 for some l1 < l2 . Then n|ql1 (ql2 l1 1). Owing to the condition
lcm(n, q) = 1, n|(qs 1) with s = l2 l1 .
(b) For an irreducible g(X), the order ord g(X) was introduced in Definition 3.1.22:
ord(g(X)) = min[n : g(X)|(X n 1)].
First, our goal is to check that if m = ord(g(X)) then m|n iff g(X)|(X n 1). Indeed, suppose m|n: n = mr. Then X n 1 = (X m 1)(1 + X m + + X m(r1) ). As
g(X)|(X m 1), this implies g(X)|(X n 1).
Conversely, if g(X)|(X n 1) then the roots of 1 , . . . , d of g(X) are among
n
n
those of X n 1 in Spl(X n 1). So, m
j = j = 1 in Spl(X 1), 1 j d. Write
a
a
n = mb + a where 0 a < m. Then nj = bm
j j = j = 1, i.e. each j is a root of
X a 1. Hence, if a > 0 then g(X)|(X a 1): a contradiction. So, a = 0 and m|n.
Calculating an irreducible polynomial g(X) Fq [X] with a given root Fqn ,
in particular, the minimal polynomial M ,Fq (X), is not easy. This is because the
relation between q, n, and d = deg M (X) is complicated. However, if =
n
n
is a primitive element of Fqn then d = n as q 1 = e, q = and n is the least
positive integer with this property. In this case M (X) = bFqn (X b).

3.1 A primer on finite fields

281

For a general irreducible polynomial, the notion of conjugacy is helpful: see Definition 3.1.34 below. This concept was introduced (and used) informally in Section
2.5 for fields F2s .
Definition 3.1.34
Elements , Fqn are called conjugate over Fq if
M ,Fq (X) = M ,Fq (X).
Summarising what was said above, we deduce the following assertion.
d1

Theorem 3.1.35 The conjugates of Fqn over Fq are , q , . . . , q Fqn

j
where d is as before. In particular,
X q has all its coefficients in Fq
0 jd1

and is a unique irreducible polynomial from Fq [X] with a root at . It is also a

unique monic polynomial of minimum degree in Fq [X] with a root at .
Worked Example 3.1.36 Continuing Worked Example 3.1.28, we identify F16
with F2 ( ), the smallest field containing a root of a primitive polynomial of
order 4. So, if we choose 1 + X + X 4 , will satisfy 4 = 1 + , and if we choose
1 + X 3 + X 4 , will satisfy 4 = 1 + 3 . In both cases, the conjugates are , 2 ,
4 and 8 .
Correspondingly, the table in (3.1.2) will take the form

power
of

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

1 + X + X4 1 + X3 + X4
vector
vector
(word)
(word)
0000
0000
1000
1000
0100
0100
0010
0010
0001
0001
1100
1001
0110
1101
0011
1111
1101
1110
1010
0111
0101
1010
1110
0101
0111
1011
1111
1100
1011
0110
1001
0011

(3.1.8)

282

Further Topics from Coding Theory

Under the left table addition rule, the minimal polynomial M i (X) for the power
i is 1 + X + X 4 for i = 1, 2, 4, 8 and 1 + X 3 + X 4 for i = 7, 14, 13, 11, while for
i = 3, 6, 12, 9 it is 1 + X + X 2 + X 3 + X 4 and for i = 5, 10 it is 1 + X + X 2 . Under the
right table addition rule, we have to swap polynomials 1 + X + X 4 and 1 + X 3 + X 4 .
Polynomials 1 + X + X 4 and 1 + X 3 + X 4 are of order 15, polynomial 1 + X + X 2 +
X 3 + X 4 is of order 5 and 1 + X + X 2 of order 3.
4
A short way to produce these answers is to find the expression for i as a
2
3
linear combination of 1, i , i and i . For example, from the left table we
have for 7 :
7 4

= 28 = 3 + 2 + 1,
7 3

= 21 = 3 + 2 ,
3
4
and readily see that 7 = 1 + 7 , which yields 1 + X 3 + X 4 . For complete 2
ness, write down the unused expression for 7 :
7 2

= 14 = 12 2 = (1 + )3 2 = (1 + + 2 + 3 ) 2
= 2 + 3 + 4 + 5 = 2 + 3 + 1 + + (1 + ) = 1 + 3 .

For M 5 (X) the standard approach gives a shortcut:

M 5 (X) = (X 5 )(X 10 ) = X 2 + ( 5 + 10 )X + 15 = X 2 + X + 1.

So, the full list of minimal polynomials for F16 is

M 0 (X) = 1 + X, M (X) = 1 + X + X 4 ,
M 3 (X) = 1 + X + X 2 + X 3 + X 4 ,
M 5 (X) = 1 + X + X 2 , M 7 (X) = 1 + X 3 + X 4 .
Example 3.1.37 For the field F32 F2 [X]/"1 + X 2 + X 5 #, the addition table is
calculated below. The minimal polynomials are
(i) 1 + X 2 + X 5 for conjugates { , 2 , 4 , 8 , 16 },
(ii) 1 + X 2 + X 3 + X 4 + X 5 for { 3 , 6 , 12 , 24 , 17 },
(iii) 1 + X + X 2 + X 4 + X 5 for { 5 , 10 , 20 , 9 , 18 },
(iv) 1 + X + X 2 + X 3 + X 5 for { 7 , 14 , 28 , 25 , 19 },
(v) 1 + X + X 3 + X 4 + X 5 for { 11 , 22 , 13 , 26 , 21 },
(vi) 1 + X 3 + X 5 for { 15 , 30 , 29 , 27 , 23 }.

3.1 A primer on finite fields

283

All minimal polynomials have order 31.

power
of

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

vector
(word)
00000
10000
01000
00100
00010
00001
10100
01010
00101
10110
01011
10001
11100
01110
00111
10111

power
of
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

vector
(word)
11111
11011
11001
11000
01100
00110
00011
10101
11110
01111
10011
11101
11010
01101
10010
01001

(3.1.9)

Definition 3.1.38
An automorphism of Fqn over Fq (in short, an (Fqn , Fq )automorphism) is a bijection : Fqn Fqn with: (a) (a + b) = (a) + (b);
(b) (ab) = (a) (b); (c) (c) = c, for all a, b Fqn , c Fq .
Theorem 3.1.39 The set of (Fqn , Fq )-automorphisms is isomorphic to the cyclic
group Zn and generated by the Frobenius map q (a) = aq , a Fqn .
Proof Let Fqn be a primitive element. Then q 1 = e and M (X) Fq [X]
2
n1
has roots , q , q , . . . , q . An (Fqn ; Fq )-automorphism fixes the coefficients
j
of M (X), thus it permutes the roots, and ( ) = q for some j, 0 j n1. But
j
as is primitive, is completely determined by ( ). Then as q j ( ) = q =
( ), we have that = q j .
n

The rest of this section is devoted to a study of roots of unity, i.e. the roots of the
polynomial X n e over field Fq where q = ps and p =char (Fq ). Without loss of
generality, we suppose from now on that
gcd(n, q) = 1, i.e. n and q are co-prime.

(3.1.10)

Indeed, if n and q are not co-prime, we can write n = mpk . Then, by Lemma 3.1.5
k

X n e = X mp e = (X m e) p ,
and our analysis is reduced to the polynomial X m e.

284

Further Topics from Coding Theory

Definition 3.1.40 The roots of polynomial (X n e) Fq [X] in the splitting field

Spl (X n e) = Fqs are called the nth roots of unity over Fq (or the (n, Fq )-roots of
unity). The set of all (n, Fq )-roots of unity is denoted by E(n) . It turns out that the
value s is the least integer s 1 such that qs 1 mod n (cf. Theorem 3.1.44 below).
This fact is reflected in denoting the value s by ordn (q) and calling it the order of
q mod n.
Under assumption (3.1.10), there is no multiple root (as the derivative X
(X n e) = nX n1 does not have roots in Spl(X n e) = Fqs ). Thus, E(n) = n.
Theorem 3.1.41

E(n) is a cyclic subgroup of Fqs .

Proof Suppose , E(n) . Then ( 1 )n = n ( n )1 = e, i.e. 1 E(n) .

So, E(n) is a subgroup of the cyclic group Fqs and so is cyclic.
Definition 3.1.42 A generator of group E(n) (i.e. an nth root of unity whose
multiplicative order equals n) is called a primitive (n, Fq )-root of unity; it will be
denoted by .
Corollary 3.1.43 There are precisely (n) primitive (n, Fq )-roots of unity. In
particular, primitive (n, Fq )-roots of unity exist for any n co-prime to q.
This allows us to calculate s in the splitting field Fqs = Spl(X n e). If is a
primitive (n, Fq )-root of unity then its multiplicative order equals n. As = 0, we
r
r
have that Fqr if q = , i.e. q 1 = e. This happens iff n|(qr 1). But s is
the least r with Fqr ! .
Theorem 3.1.44 Spl(X n e) = Fqs , where s = ordn (q) is the least integer 1
for which n|(qs 1), i.e. the least integer s 1 with qs 1 mod n.
It is instructive to stress similarities and differences between primitive elements
and primitive (n, Fq )-roots of unity in field Fqs with s = ordn (q). A primitive field
s
element, , generates the multiplicative cyclic group Fqs : Fqs = {e, , . . . , q 2 };
its multiplicative order equals qs 1. A primitive root of unity, , generates the
multiplicative cyclic group E(n) : E(n) = {e, , . . . , n1 }; its multiplicative order
equals n. [On the other hand, generates Fqs as a field element: Fqs = Fq ( ) =
Fq (E(n) ).] This suggests that = happens iff n = qs 1. In fact, let us ask under
what condition a power k is a primitive nth root of unity. As was established in
Worked Example 3.1.33 this happens when n|(qs 1), i.e. qs 1 = nr. In fact, if
k 1 is such that
gcd(k, nr) = gcd(k, qs 1) = r
then element k is a primitive nth root of unity as its multiplicative order equals
nr
qs 1
=
= n.
gcd(k, qs 1)
r

3.1 A primer on finite fields

285

This holds when k = ru and u is co-prime with n. Conversely, if k is a primitive

root of unity then gcd(k, qs 1) = (qs 1)/n. Hence we obtain the following.
Theorem 3.1.45 Let P(n) be the set of the primitive (n, Fq )-roots of unity and T(n)
the set of primitive elements in Fqs = Spl(X n e). Then either (i) P(n) T(n) = 0/
or (ii) P(n) = T(n) ; case (ii) occurs iff n = qs 1.
Now we can factorise polynomial (X n e) over Fq by taking the product of the
distinct minimal polynomials for the (n, Fq )-roots of unity:

X n e = lcm M (X) : E(n) .
(3.1.11)
If we begin with a primitive element Fqs where s = ordn (q) then = (q 1)/n
is a primitive root of unity and E(n) = {e, , . . . , n1 }.
This enables us to calculate the minimal polynomial M i (X). For all i =
d1
0, . . . , n 1, the conjugates of i are i , iq , . . . , iq where d(= d(i)) is the least
d
d
positive integer for which iq = i , i.e. iq i = e. This is equivalent to n|(iqd i),
i.e. iqd = i mod n. Therefore,

d1
Mi (X) = M i (X) = X i X iq X iq
.
(3.1.12)
s

Definition 3.1.46 The set of exponents i, iq, . . . , iqd1 where d(= d(i)) is the
minimal positive integer such that iqd = i mod n is called a cyclotomic coset (for i)
and denoted by Ci (= Ci (n, q)) (alternatively, C i is defined as the set of non-zero
d1
field elements i , iq , . . . , iq ).
Worked Example 3.1.47 Check that polynomials X 2 + X + 2 and X 3 + 2X 2 + 1
are primitive over F3 and compute the field tables for F9 and F27 generated by these
polynomials.
Solution The field F9 is isomorphic to F3 [X]/"X 2 + X + 2#. The multiplicative
powers of X are

2 2X + 1, 3 2X + 2, 4 2,
5 2X, 6 X + 2, 7 X + 1, 8 1.
The cyclotomic coset of is { , 3 } (as 9 = ). Then the minimal polynomial
M (X) = (X )(X 3 ) = X 2 ( + 3 )X + 4
= X 2 2X + 2 = X 2 + X + 2.
Hence, X 2 + X + 2 is primitive.

286

Further Topics from Coding Theory

Next, F27 F3 [X]/"X 3 + 2X 2 + 1#, and with X, we have

2 X 2 , 3 X 2 + 2, 4 X 2 + 2X + 2, 5 2X + 2,
6 2X 2 + 2X, 7 X 2 + 1, 8 X 2 + X + 2,
9 2X 2 + 2X + 2, 10 X 2 + 2X + 1, 11 X + 2,
12
X 2 + 2X, 13 2, 14 2X, 15 2X 2 , 16 2X 2 + 1,
17 2X 2 + X + 1, 18 X + 1, 19 X 2 + X,
20
2X 2 + 2, 21 2X 2 + 2X + 1, 22 X 2 + X + 1,
23 2X 2 + X + 2, 24 2X + 1, 25 2X 2 + X, 26 1.
The cyclotomic coset of in F27 is { , 3 , 9 }. Consequently, the primitive polynomial
M (X) = (X )(X 3 )(X 9 )
= X 3 ( + 3 + 9 )X 2 + ( 4 + 10 + 12 )X 13
= X 3 + 2X 2 + 1
as required.
Worked Example 3.1.48 (a) Consider the polynomial X 15 1 over F2 (with
n = 15, q = 2). Then = 2, s = ord15 (2) = 4 and Spl(X 15 1) = F24 = F16 .
The polynomial g(X) = 1 + X + X 4 is primitive: any of its roots are primitive
in F16 . So, the primitive (15, F2 )-root of unity is

= (2

4 1)/15

= .

Hence, the roots of X 15 1 are 1, , . . . , 14 . The minimal polynomials for them

have been calculated in Worked Example 3.1.36. So, we have the factorisation
X 15 1 = (1 + X)(1 + X + X 4 )(1 + X + X 2 + X 3 + X 4 )
(1 + X + X 2 )(1 + X 3 + X 4 ).
(b) Knowing the cyclotomic cosets we can show that a particular factorisation of
X n e contains irreducible factors. Explicitly, take the polynomial X 9 1 over F2
(with n = 9, q = 2). There are three cyclotomic cosets:
C0 = {0},C1 = {1, 2, 4, 8, 7, 5},C3 = {3, 6};

the corresponding minimal polynomials are of degree 1, 6 and 2, respectively:

1 + X, 1 + X 3 + X 6 and 1 + X + X 2 .

This yields
X 9 1 = (1 + X)(1 + X + X 2 )(1 + X 3 + X 6 ).

3.1 A primer on finite fields

287

(c) Let us check primitivity of the polynomial

f (X) = 1 + X + X 6

over F2 , with n = 6, q = 2. Here, 26 1 = 63 = 32 7. As 63/3 =

21, 32 | ord( f (X)) X 21 1 = 0 mod (1 + X + X 6 ). But X 21 = 1 + X + X 3 +
X 4 + X 5 = 1 mod (1 + X + X 6 ), so 32 | ord( f (X)).
Next, as 63/7 = 9, 7| ord( f (X)) X 9 1 = 0 mod (1+X +X 6 ). But X 9 = 1+
X 3 + X 4 = 1 mod (1 + X + X 6 ), so 7| ord( f (X)). Therefore, ord( f (X)) = 63, and
f (X) is primitive. Theorem 3.1.53 below shows that any irreducible polynomial of
order 63 has degree 6 as 26 = 1 mod 63.
(d) Now consider the polynomial
g(X) = 1 + X + X 2 + X 4 + X 6 ,

again over F2 (here n = 6 and q = 2, as before). Again 32 | ord(g(X)) X 21 = 1

mod (1 + X + X 2 + X 4 + X 6 ). However, in F2
X 21 1 = (1 + X)(1 + X + X 2 )(1 + X + X 3 )(1 + X 2 + X 3 )
(1 + X + X 2 + X 4 + X 6 )(1 + X 2 + X 4 + X 5 + X 6 ).

Hence, X 21 1 = 0 mod (1 + X + X 2 + X 4 + X 6 ) = 1, and so 32 does not divide

ord(g(X)).
Next, 3| ord(g(X)) X 7 = 1 mod (1 + X + X 2 + X 4 + X 6 ). As X 7 = X + X 2 +
X 3 + X 5 = 1 mod (1 + X + X 2 + X 4 + X 6 ), 3 is a divisor for ord(g(X)).
Finally, 7| ord(g(X)) X 9 = 1 mod (1 + X + X 2 + X 4 + X 6 ), and as X 9 = 1 +
X 2 +X 4 = 1 mod (1+X +X 2 +X 4 +X 6 ), 7 divides ord(g(X)). So, ord(g(X)) = 21.
Let us summarise results about minimal polynomials and roots of unity. We
know from Theorem 3.1.25 that for all integers d 1 and for all q = pd , where
p is prime and s 1 integer, there exists a primitive polynomial of degree d, say
M (X), where is a primitive element in the field Fqd . On the other hand, for all
irreducible polynomials f (X) Fq [X] of degree d, the roots of f (X) lie in the field
Spl( f (X)) = Fqd and have the same multiplicative order ord( f (X)).
Theorem 3.1.49 Let polynomial f (X) Fq [X] be irreducible, of degree d , and
ord( f (X)) = . Then:
(a)
(b)
(c)
(d)

288

Further Topics from Coding Theory

Proof (a) Spl( f (X)) = Fqd , hence every root of f (X) is a root of X q
it has ord( )|(qd 1).

d 1

e. So,

(d) Follows from (c).

Theorem 3.1.50 If f (X) Fq [X] is an irreducible polynomial of degree d and
order then d = ord (q).
Proof If Fqd has f ( ) = 0 then by Theorem 3.1.29, Fq ( ) = Fqd =
Spl( f (X)). But is also a primitive (, Fq )-root of unity, so Fq ( ) = Fq (E() ) =
Spl(X e) = Fqs where s = ord (q). Hence, d = ord (q).
Worked Example 3.1.51 Use the Frobenius map : a aq to prove that every
element a Fqn has a unique q j th root, for j = 1, . . . , n 1.
Suppose that q = ps is odd. Show that exactly a half of the non-zero elements of
Fq have square roots.
Solution The Frobenius map : a aq is a bijection Fqn Fqn . So, for all b Fqn
j
there exists unique a with aq = b (the qth root). The jth power iteration j : a aq
j
is also a bijection, so again for all b Fqn there exists unique a with aq = b.
j
Observe that for all c Fq , c1/q = c.
Now take : a a2 , a multiplicative homomorphism Fq Fq . If q is odd then

Fq Zq1 has an even number of elements q 1. We want to show that if (a) =

b then 1 (b) consists of two elements, a and a. In fact, (a) = b. Also, if
(a ) = b then (a a1 ) = e.
So, we want to analyse 1 (e). Clearly, e 1 (e). On the other hand, if
is a primitive element then ( (q1)/2 ) = q1 = e and 1 (e) consists of e = 0
and (q1)/2 . So, (q1)/2 = e.
Now if (a a1 ) = e then a a1 = e and a = a. Hence, sends precisely two
elements, a and a, into the same image, and its range (Fq ) is a half of Fq .
Theorem 3.1.52 (cf. [92], Theorem 3.46.) Let polynomial p(X) Fq [X] be irreducible, of degree n. Set m = gcd(d, n). Then m|n and p(X) factorises over Fqd into
m irreducible polynomials of degree n/m each. Hence, p(X) is irreducible over Fqd
iff m = 1.
Theorem 3.1.53 (cf. [92], Theorem 3.5.) Let gcd(d, q) = 1. The number of monic
irreducible polynomials of order and degree d equals ()/d if 2, and the

3.1 A primer on finite fields

289

degree d = ord (q), equals 2 if = d = 1, equals 0 in all other cases. In particular, the degree of an order irreducible polynomial always equals ord (q), i.e. the
minimal s such that qs = 1 mod . Here () is the Euler totient function.
The proofs of Theorems 3.1.52 and 3.1.53 are omitted (see [92]). We only
make a short comment about Theorem 3.1.53. If p(0) = 0, the order of irreducible
polynomial p(X) of degree d coincides with the order of any of its roots in the
multiplicative group Fqd . So, the order is iff d = ord (q) and p(X) divides the
so-called circular polynomial
Q (X) =

(X s ).

s:gcd(s,)=1

In fact, the circular polynomial could be decomposed into a product of irreducible

polynomials, all of degree d = ord (q), and their number equals ()/d. (In the
case d = = 1 the polynomial p(X) = X should be accounted for as well.)
Concluding this section, we give short summaries of the facts of the theory of
finite fields discussed above.
Summary 1.55. A field is a ring such that its non-zero elements form a commutative group under multiplication. (i) Any finite field F has the number of elements
q = ps where p is prime, and the characteristic char(F) = p. (ii) Any two finite
fields with the same number of elements are isomorphic. Thus, for a given q = ps ,
there exists, up to isomorphism, a unique field of cardinality q; such a field is denoted by Fq (it is often called a Galois field of size q). When q is prime, the field
Fq is isomorphic to the additive cyclic group Z p of p elements, equipped with multiplication mod p. (iii) The multiplicative group Fq of non-zero elements from Fq
is isomorphic to the additive cyclic group Zq1 of q 1 elements. (iv) Field Fq
contains Fr as a subfield iff r|q; in this case Fq is isomorphic to a linear space over
(i.e. with coefficients from) Fr , of dimension log p (q/r). So, each prime number
p gives rise to an increasing sequence of finite fields F ps , s = 1, 2, . . . An element
Fq generating the multiplicative group Fq is called a primitive element of Fq .
Summary 1.56. The polynomial ring over Fq is denoted by Fq [X]; if the polynomials are considered mod g(X),
a fixed polynomial from
Fq [X], the corresponding ring is denoted by Fq [X] "g(X)#. (i) Ring Fq [X] "g(X)# is a field iff g(X)
is irreducible over Fq (i.e. does not admit a decomposition g(X) = g1 (X)g2 (X)
where deg(g1 (X)), deg(g2 (X)) < deg(g(X))). (ii) For any q and a positive integer
d there exists an irreducible polynomial g(X) over Fq of degreed. (iii) If g(X) is
irreducible
and deg g(X) = d then the cardinality of field Fq [X] "g(X)# is qd , i.e.

Fq [X] "g(X)# is isomorphic to Fqd and belongs to the same series of fields as Fq
(that is, char(Fqd ) = char(Fq )).

290

Further Topics from Coding Theory

Summary 1.57. An extension of a field Fq by a finite family of elements 1 , . . . , u

(contained in a larger field from the same series) is the smallest field containing Fq
and i , 1 i u. Such a field is denoted by Fq (1 , . . . , u ). (i) For any monic
polynomial p(X) Fq [X] there exists a larger field Fq from the same series as Fq
such that p(X) factors over Fq :

p(X) = X j , u = deg p(X), 1 , . . . , u Fq .
(3.1.13)
1 ju

The smallest field Fq with this property (i.e. field Fq (1 , . . . , u )) is called a splitting field for p(X); we also say that p(X) splits over Fq (1 , . . . , u ). The splitting
field for p(X) is denoted by Spl(p(X)); an element Spl(p(X)) takes part in decomposition (3.1.13) iff p( ) = 0. Field Spl(p(X)) is described as the set {g( j )}
where j = 1, . . . , u, and g(X) Fq [X] are polynomials of degree < deg(p(X)). (ii)
Field Fq is splitting for the polynomial X q X. (iii) If polynomial p(X) of degree d isirreducible over Fq and is a root of p(X) in field Spl(p(X)) then Fqd
Fq [X] "p(X)# is isomorphic to Fq ( ) and all the roots of p(X) in Spl(p(X))
2
d1
are given by the conjugate elements , q , q , . . . , q . Thus, d is the smalld
est positive integer for which q = . (iv) Suppose that, for a given field Fq ,
a monic polynomial p(X) Fq [X] and an element from a larger field we have
p( ) = 0. Then there exists a unique minimal polynomial M (X) with the property
that M ( ) = 0 (i.e. such that any other polynomial p(X) with p( ) = 0 is divided
by M (X)). Polynomial M (X) is the unique irreducible polynomial over Fq vanishing at . It is also the unique polynomial of the minimum degree vanishing at .
We call M (X) the minimal polynomial of over Fq . If is a primitive element
of Fqd then M (X) is called a primitive polynomial for Fqd over Fq . We say that
elements , Fqd are conjugate over Fq if they have the same minimal polyd1
nomial over Fq . Then (v) the conjugates of Fqd over Fq are , q , . . . , q ,
d
where d is the smallest positive integer with q = . When = i where
is a primitive element, the congugacy class is associated with a cyclotomic coset
d1
C i = { i , iq , . . . , iq }.
Summary 1.58. Now assume that n and q = ps are co-prime and take polynomial
X n e. The roots of X n e in the splitting field Spl(X n e) are called nth roots
of unity over Fq . The set of all nth roots of unity is denoted by En . (i) Set En is
a cyclic subgroup of order n in the multiplicative group of field Spl(X n e). An
nth root of unity generating En is called a primitive nth root of unity. (ii) If Fqs is
Spl(X n e) then s is the smallest positive integer with n|(qs 1). (iii) Let n be the
set of primitive nth roots of unity over field Fq and n the set of primitive elements
of the splitting field Fqs = Spl(X s e). Then either n n = 0/ or n = n , the
latter happening iff n = qs 1.

3.2 ReedSolomon codes. The BCH codes revisited

291

3.2 ReedSolomon codes. The BCH codes revisited

From now on we consider finite fields Fq up to isomorphism, but from time to
time refer to a specific field table (e.g. by specifying F ps as F p [X]/"P(X)# where
P(X) F p [X] is an irreducible polynomial of degree s).
In Definition 2.5.37 we introduced narrow-sense binary BCH codes. Our study
BCH
will continue in this section with general q-ary BCH codes Xq,N,
, ,b , of length

N, designed distance and zeros b , . . . , b+ 1 ; see in Definition 3.2.7 below.

Prior to that, we discuss an interesting special class of BCH codes formed by the
ReedSolomon (RS) codes; as we shall see, their analysis is facilitated by the fact
that the RS codes are MDS (maximum distance separable).
Definition 3.2.1 Given q 3, a q-ary ReedSolomon code is defined as a cyclic
code of length N = q 1 with the generator
g(X) = (X b )(X b+1 ) . . . (X b+ 2 ),

(3.2.1)

where and b are integers, 1 , b < q 1, and is a primitive element of Fq

(or
a primitive Nth root of unity). Such a code is denoted by X RS
equivalently,

= Xq,RS
, ,b .
BCH
According to Definition 3.2.7, the RS code is identified as Xq,q1,
, ,b , i.e. as a
q-ary BCH code of length q 1 and designed distance . There are no reasonable
binary RS codes, as in this case the length q 1 = 1. Observe that q 1 gives the
number of non-zero elements in the alphabet field Fq . Moreover, for N = q 1 we
have

X N e = X q1 e =

(X )

(as the splitting field Spl (X q X) is Fq ). Furthermore, owing to the fact that is a
primitive (q 1, Fq ) root of unity (or, equivalently, a primitive element of Fq ), the
minimal polynomial Mi (X) is just X i , for all i = 0, . . . , N 1.
An important property is that the RS codes are MDS. Indeed, the generator g(X)
of Xq,RS
, ,b has deg g(X) = 1. Hence, the rank k is given by
k = dim(Xq,RS
, ,b ) = N deg g(X) = N + 1.

(3.2.2)

By the generalised BCH bound (see Theorem 3.2.9 below), the minimal distance

d Xq,RS
, ,b = N k + 1.
But the Singleton bound states that d(X RS ) N k + 1. Hence,
RS
d(Xq,d,
,b ) = N k + 1 = .

(3.2.3)

292

Further Topics from Coding Theory

Thus the RS codes have the largest possible minimal distance among all q-ary
codes of length q 1 and dimension k = q . Summarising, we obtain
Theorem 3.2.2

The code Xq,RS

, ,b is MDS and has distance and rank q .

The dual of a BCH code is not always BCH. However,

Theorem 3.2.3

The dual of an RS code is an RS code.

RS
Proof The proof is straightforward, as (Xq,RS
, ,b ) = Xq,q , ,b+ 1 .

Theorem 3.2.4 Let X RS be a [N, k, ] RS code. Then its parity-check extension

is a [N + 1, k, + 1] code, with distance one more than that of X RS .
Proof Let c(X) = c0 + c1 X + + cN1 X N1 X RS , with weight w(c(X)) = .
Its extension is c(X) = c(X) + cN X N , with cN = ci = c(e). We want to
0iN1

show that c(e) = 0 and hence w(

c(X)) = + 1.
To simplify notation assume that b = 1 and let g(X) = (X )(X 2 ) . . . (X
1 ) be the generator of X RS . Write c(X) = g(X)p(X) for some p(X), yielding
that c(e) = p(e)g(e). Clearly, g(e) = 0, as i = e for all i = 1, . . . , 1. If p(e) = 0,
the polynomial g1 (X) = (X e)g(X) divides c(X). Then c(X) "g1 (X)# where
g1 (X) = (X e)(X ) . . . (X 1 ). That is, "g1 (X)# is BCH, with the designed
distance + 1. But this contradicts the choice of c(X).
RS codes admit specific (and elegant) encoding and decoding procedures. Let
X RS be an [N, k, ] RS code, with N = q 1. For a message string a0 . . . ak1 set
a(X) = ai X i and encode a(X) as c(X) =
a( j )X j . To show that
0 jN1

0ik1

c(X) X RS , we have to check that c( ) = = c( 1 ) = 0. Think of a(X) as

a polynomial ai X i with ai = 0 for i k, and use
0iN1

Lemma 3.2.5 Let a(X) = a0 + a1 X + + aN1 X N1 Fq [X] and be a primitive (N, Fq ) root of unity over Fq , N = q 1. Then
ai =

1
a( j ) i j .
N 0 jN1

We postpone the proof till after Lemma 3.2.12.

Indeed, by Lemma 3.2.5
ai =

1
1
1
a( j ) i j = N c( i ) = N c( Ni ),
N 0 jN1

(3.2.4)

3.2 ReedSolomon codes. The BCH codes revisited

293

so c( j ) = NaN j . For 0 j 1 = N k, c( j ) = NaN j = 0. Therefore,

c(X) X RS . In addition, the original message is easy to recover from c(X): ai =
1
Ni ).
N c(
To decode the received word u(X) = c(X) + e(X), write
ui = ci + ei = ei + a( i ), 0 i N 1.
Then obtain
u0 = e0 + a0 + a1 + + ak1 ,
u1 = e1 + a0 + a1 + + ak1 k1 ,
u2 = e2 + a0 + a1 2 + + ak1 2(k1) ,
..
.
uN1 = eN1 + a0 + a1 N1 + + ak1 (N1)(k1) .
If there are no errors, i.e. e0 = = eN1 = 0, any k of these equations can
be solved in the k unknowns a0 , . . . , ak1 , as the corresponding matrix is Vandermonde. In fact, any subsystem of k equations can be solved for any error vector (it
is a different matter if the solution will give the correct string a0 , . . . , ak1 or not).
Now suppose that t errors have occurred, t < N k. Call the equations with
ei = 0 good and ei = 0 bad, then we have
t bad
and N t good ones. If we solve
N t
all subsystems of k equations then the
subsystems consisting of k good
k
equations will give the correct values of the ai s. Moreover, a given incorrect solution cannot satisfy any set of k good equations; it can satisfy at most k 1 correct
equations. In addition, it can satisfy at most t incorrect
equations.
So, it is a solut+k1

tion to t + k 1 equations, i.e. can be obtained
times from subsystems
k
of k equations. Hence, if

t +k1
N t
,
>
k
k
the majority solution from among (Nk ) solutions gives the true values of the ai s. The
last inequality holds iff N t > t + k 1, i.e. = N k + 1 > 2t. Therefore we get:

For a [N, k, ] RS code X RS , the majority

logic decoding corN
systems of equations
rects up to t < /2 errors, at the cost of having to solve
k
of size k k.

Theorem 3.2.6

ReedSolomon codes were discovered in 1960 by Irving S. Reed and Gustave

Solomon, both working at that time in the Lincoln Laboratory of MIT. When their
joint article was published, an efficient decoding algorithm for these codes was

294

Further Topics from Coding Theory

not known. Such an algorithm solution for the latter was found in 1969 by Elwyn Berlekamp and James Massey, and is known since as the BerlekampMassey
decoding algorithm (cf. [20]); see Section 3.3. Later on, other algorithms were
proposed: continued fraction algorithm and Euclidean algorithm (see [112]).
ReedSolomon codes played an important role in transmitting digital pictures
from American spacecraft throughout the 1970s and 1980s, often in combination
with other code constructions. These codes still figure prominently in modern space
missions although the advent of turbo-codes provides a much wider choice of coding and decoding procedures.
ReedSolomon codes are also a key component in compact disc and digital game
production. The encoding and decoding schemes employed here are capable of correcting bursts of up to 4000 errors (which makes about 2.5mm on the disc surface).
BCH
Definition 3.2.7 A BCH code Xq,N,
, ,b with parameters q, N, , and b is the
q-ary cyclic code XN = "g(X)# with length N, designed distance , such that its
generating polynomial is

(3.2.5)
g(X) = lcm M b (X), M b+1 (X), . . . , M b+ 2 (X) ,

i.e.

*
BCH
=
f (X) Fq [X] mod (X N 1) :
Xq,N,
, ,b
+
f ( b+i ) = 0, 0 i 2 .

If b = 1, this is a narrow sense BCH code. If is a primitive Nth root of unity, i.e. a
primitive root of the polynomial X N 1, the BCH code is called primitive. (Recall
that under condition gcd(q, N) = 1 these roots form a commutative multiplicative
group which is cyclic, of order N, and is a generator of this group.)
Lemma 3.2.8

BCH
The BCH code Xq,N,
, ,b has minimum distance .

Proof Without loss of generality consider a narrow sense code. Set the paritycheck ( 1) N matrix

2
...
N1
1
1 2
4
...
2(N1)

H = .
.
..
..
.
.
.

.
.
.
.

1
2(

1)
(

1)(N1)

...
1
The codewords of X are linear dependence relations between the columns of H.
Then Lemma 2.5.40 implies that any 1 columns of H are linearly independent.
In fact, select columns with top (row 1) entries k1 , . . . , k 1 where 0 k1 < <
k 1 N 1. They form a square ( 1) ( 1) matrix

3.2 ReedSolomon codes. The BCH codes revisited

295

k1 1
k2 1
...
k 1 1
k
k
k
k1 k1
2 2
...
1 k 1

..
..
..
..

.
.
.
.
k1 k1 ( 2) k2 k2 ( 2) . . . k 1 k 1 ( 2)
that differs from the Vandermonde matrix by factors ks in front of the sth column.
Then the determinant of D is the product

1

1
...
1

k
k
k
1
2
...
1
1

det D = ks

..
..
..
..

.
s=1
.
.
.

k1 ( 2) k2 ( 2) . . . k 1 ( 2)

1
k

k
k
s
i
j
= 0,

=
i> j

s=1

and any 1 columns of H are indeed linearly independent. In turn, this means
that any non-zero codeword in X has weight at least . Thus, X has minimum
distance .
Theorem 3.2.9 (A generalisation of the BCH bound) Let be a primitive N th
root of unity and b 1, r 1 and > 2 integers, with gcd(r, N) = 1. Consider a
cyclic code X = "g(X)# of length N where g(X) is a monic polynomial of smallest degree with g( b ) = g( b+r ) = = g( b+( 2)r ) = 0. Prove that X has
d(X ) .
Proof As gcd(r, N) = 1, r is a primitive root of unity. So, we can repeat the
proof given above, with b replaced by bru where ru is found from ru + Nv = 1. An
alternative solution: the matrix N ( 1)

1
1
... 1

b
b+r
. . . b+( 2)r

2b
2(b+r)
. . . 2(b+( 2)r)

.
.
.
.
.
.
.
.

.
. .
.
(N1)b
(N1)b+r
(N1)(b+(

2)r)

...
checks the code X = "g(X)#. Take any of its ( 1) ( 1) submatrices, say,
with rows i1 < i2 < < i 1 . Denote it by D = (D jk ). Then
det D =

(il 1)b det( r(i j 1)( 2) )

(il 1)b det (Vandermonde) = 0,

1l 1

because gcd(r, N) = 1. So, d(X) .

296

Further Topics from Coding Theory

Worked Example 3.2.10 Let be a primitive n-root of unity in an extension

field of Fq and a(X) = ai X i be a polynomial of degree at most n 1. The
0in1

MattsonSolomon polynomial is defined by

a( j )X n j .

aMS (X) =

(3.2.6)

j=1

Let q = 2 and a(X) F2 [X]/"X n 1#. Prove that the MattsonSolomon polynomial
aMS (X) is idempotent, i.e. aMS (X)2 = aMS (X) in F2 [X]/"X n 1#.
Solution Let a(X) =

0in1

ai X i , then nai = aMS ( i ), 0 i n 1, by Lemma

3.2.5. In F2 , (nai )2 = nai , so aMS ( i )2 = aMS ( i ). For polynomials, write b(2) (X)
for the square in F2 [X] and b(X)2 for the square in F2 [X]/"X n 1#:
b(2) (X) = c(X)(X n 1) + b(X)2 .
Then
(2)

aMS (X) X= i = (aMS (X) X= i )2 = aMS (X) X= i

= aMS (X)2 X= i ,
i.e. polynomials aMS (X) and aMS (X)2 agree at 0 = e, , . . . , n1 . Write this in
the matrix form, with aMS (X) = a0,MS + a1,MS X + + an1,MS X n1 , aMS (X)2 =
a 0,MS X + + a n1,MS X n1 :

e e
... e
e
. . . n1

(2)
(aMS aMS ) .. ..
= 0.
..
.
.
. .

. .
e n1 . . . (n1)

As the matrix is Vandermonde, its determinant is

( j i ) = 0,

0i< jn1
(2)

and aMS = aMS . So, aMS (X) = aMS (X)2 .

Definition 3.2.11 Let v = v0 v1 . . . vN1 be a vector over Fq , and let be a primitive (N, Fq ) root of unity over Fq . The Fourier transform of the vector v is the
vector V = V0V1 . . .VN1 with components given by
N1

Vj =

i=0

i j vi , j = 0, . . . , N 1.

(3.2.7)

3.2 ReedSolomon codes. The BCH codes revisited

297

Lemma 3.2.12 (The inversion formula) The vector v is recovered from its Fourier
transform V by the formula
vi =

1 N1 i j
Vj.
N j=0

(3.2.8)

Proof In any field X N 1 = (X 1)(X N1 + + X + 1). As the order of is N,

for any r, r is a zero of LHS. Hence for all r = 0 mod N, r is a zero of the last
term, i.e.
N1

r j = 0 mod N.

j=0

On the other hand, for r = 0

r j = N mod p

j=0

which is not zero if N is not a multiple of the field characteristic p. But q 1 =

ps 1 is a multiple of N, so N is not a multiple of p. Hence, N = 0 mod p. Finally,
change the order of summation to obtain that
1 N1 N1
1 N1 i j
V j = vk (ki) j = vi .

N j=0
N k=0
j=0

Proof of Lemma 3.2.5 Let a(X) = a0 + a1 X + + aN1 X N1 Fq [X] and be

a primitive (N, Fq ) root of unity over Fq . Then write

N 1

a( j ) i j = N 1

0 jN1

0kN1

0 jN1

j(ki)

ak jk i j

0kN1

ak N ki = ai .

0kN1

Here we used the fact that, for 1 N 1, = 1, and

0 jN1

j =

( ) j = (e ( )N )(e )1 = 0.

0 jN1

Hence
ai =

1
a( j ) i j .
N 0 jN1

(3.2.9)

298

Further Topics from Coding Theory

Worked Example 3.2.13 Give an alternative proof of the BCH bound: Let be a
primitive (N, Fq ) root of unity and b 1 and 2 integers. Let XN = "g(X)# be a
cyclic code where g(X) Fq [X]/"X N e# is a monic polynomial of smallest degree
having b , b+1 , . . . , b+ 2 among its roots. Then XN has minimum distance at
least .
Solution

Let a(X) =

0 jN1

a j X j XN satisfy condition g(X)|a(X) and

a( i ) = 0 for i = b, . . . , b + 2. Consider the MattsonSolomon polynomial

cMS (X) for a(X):
cMS (X) =

a( i )X i =

0iN1

a( Ni )X i

0iN1

a( )X
i

1iN

a( j )X N j + 0 + + 0 (from b , . . . , b+ 2 )

1 jb1

+ a( b+ 1 )X Nb +1 + + a( N ).

(3.2.10)

Multiply by X b1 and group:

X b1 cMS (X) = a( )X N+b2 + + a( b1 )X N
+a( b+ 1 )X N + + a( N
)X b1
= X N a( )X b2 + + a( b1 )

+ a( b+ 1 )X N + + a( N )X b1
= X N p1 (X) + q(X)
= (X N e)p1 (X) + p1 (X) + q(X).
We see that cMS ( i ) = 0 iff p1 ( i ) + q( i ) = 0. But p1 (X) + q(X) is a polynomial
of degree N so it has at most N roots. Thus, cMS (X) has at most N
roots of the form i .
Therefore, the inversion formula (3.2.8) implies that the weight w(a(X)) (i.e. the
weight of the coefficient string a0 . . . aN1 ) obeys
w(a(X)) N the number of roots of cMS (X) of the form i .

(3.2.11)

That is,
w(a(X)) N (N ) = .

We finish this section with a brief discussion of the GuruswamiSudan decoding

algorithm, for list decoding of ReedSolomon codes. First, we have to provide an
alternative description of the ReedSolomon codes (as Reed and Solomon have

3.2 ReedSolomon codes. The BCH codes revisited

299

done it in their joint paper). For brevity, we take the value b = 1 (but will be able
to extend the definition to values of N > q 1).
Given N q, let S = {x1 , . . . , xN } Fq be a set of N distinct points in Fq (a
supporting set). Let Ev denote the evaluation map
Ev : f Fq [X] Ev( f ) = ( f (x1 ), . . . , f (xN )) FNq

(3.2.12)

and take
L = { f Fq [X] : deg f < k}.

(3.2.13)

Then the q-ary ReedSolomon code of length N and dimension k can be defined as
X = Ev(L);

(3.2.14)
B
d 1
erit has the minimum distance d = d(X ) = N k + 1 and corrects up to
2
rors. The encoding of a source message u = u0 . . . uk1 Fkq consists in calculating
the values of the polynomial f (X) = u0 + u1 X + + uk X k1 at points xi S.
Definition 3.2.1 (where X was defined as the set of polynomials c(X) =
cl X l Fq [X] with c( ) = c( 2 ) = = c( 1 ) = 0) emerges when
A

0l<q1

N = q 1, k = N + 1 = q , the supporting set S = {e, , . . . , N1 } and

the coefficients c0 , c1 , . . . , cN1 are related to the polynomial f (X) by
ci = f ( i ), 0 i N 1.
This determines uniquely the coefficients fl in the representation f (X) =
fl X l , via the discrete inverse Fourier transform relation

0l<N

N fl = c( Nl ), or N fNl1 = c( l+1 ), l = 0, . . . , N 1,
guaranteeing, in particular, that fk = = fN1 = 0.
Given f Fq [X] and y = y1 . . . yN FNq , set
dist ( f , y) =

1( f (xi ) = yi ).

1iN

B
d 1
. The above2
mentioned conventional decoding algorithms (the BerlekampMassey algorithm,
the continued fractions algorithm and the Euclidean algorithm) follow the same
principle: the algorithm either finds a unique f such that dist ( f , y) t or reports
that such f does not exist. On the other hand, given s > t, list decoding attempts to
find all f with dist ( f , y) s; the hope is that if we are lucky, the codeword with
this property will be unique, and we will be able to correct s errors, exceeding the
conventional limit of t errors.
Now assume y = y1 . . . yN is a received word and set t =

300

Further Topics from Coding Theory

This idea goes back to Shannons bounded distance decoding: upon receiving
a word y, you inspect the Hamming balls around y until you encounter a closest
codeword (or a collection of closest codewords) to y. Of course, we want two
things: that (i) when we take s moderately larger than t, the chance of finding
two or more codewords within distance s is small, and (ii) the algorithm has a
reasonable computational complexity.
Example 3.2.14 The [32, 8] RS code over F32 has d = 25 and t = 12. If we take
s = 13, the Hamming ball about the received word y may contain two codewords.
However, assuming that all error vectors e of weight 13 are equally likely, the
probability of this event is 2.08437 1012 .
The GuruswamiSudan list decoding algorithm (see [59]) performs the task of
finding the codewords within distance s for t s tGS in a polynomial time. Here
L$
M
tGS = n 1
(k 1)n ,
and tGS can considerably exceed t.
In the above example, tGS = 17. Asymptotically, for RS codes of rate R, the
conventional decoding algorithms will correct
a fraction (1 R)/2 of errors, while

the GS algorithm can correct up to 1 R. The expected number of codewords in

a ball of radius s tGS (under the assumption of error-vector equidistribution) can
also be assessed.
The GuruswamiSudan algorithm works not only for the RS codes. In the original GS paper, the algorithm was shown to perform well for several classes of codes;
later on it was extended to cover the RM codes as well (see [7]).

3.3 Cyclic codes revisited. Decoding the BHC codes

Let us begin afresh. As before, we assume that gcd(N, q) = 1 (so if q = 2, N is odd),
and write words x HN,q as x0 . . . xN1 . Remind that a linear code X HN is
called cyclic if, for all x = x0 . . . xN1 X , the cyclic shift x = xN1 x0 . . . xN2
X . With each word c = c0 . . . cN1 we associate a polynomial c(X) Fq [X]:
c(X) = c0 + c1 X + + cN1 X N1 .
The map c c(X) is an isomorphism between X and a linear subspace of Fq [X].
Writing c(X) X simply means that the coefficient string c0 . . . cN1 X .
Lemma 3.3.1 The code X is cyclic iff its image under the above isomorphism
is an ideal in the quotient ring Fq [X]/"X N e#.
Proof Cyclic shift corresponds to multiplying a polynomial c(X) by X. Hence,
multiplication by any polynomial preserves X .

3.3 Cyclic codes revisited. Decoding the BHC codes

301

It is fruitful to think of X as the ideal in Fq [X]/"X N e# and consider all polynomials mod (X N e). Moreover, Fq [X]/"X N e# is a principal ideal ring: each
its ideal is of the form
"g(X)# = { f (X) : f (X) = g(X)h(X), h(X) Fq [X]/"X N e#}

(3.3.1)

where g(X) is a fixed polynomial.

Theorem 3.3.2 If the code X HN,q is cyclic then there exists a unique monic
polynomial g(X) X such that:
(i) X = "g(X)#;
(ii) g(X) has the minimum degree among all polynomials f (X) X . Furthermore,
(a)
(b)
(c)
(d)

g(X)|(X N e),
if deg g(X) = d then dim X = N d ,
X = { f (X) : f (X) = g(X)h(X), h(X) Fq [X], deg h(X) < N d},
if g(X) = g0 + g1 X + g2 X 2 + + gd X d , with gd = e, then g0 = 0 and

0 0 ... 0
g0 g1 g2 . . . gd
0 g0 g1 . . . gd1 gd 0 . . . 0

...
...
0

...

. . . gd

is a generating matrix for X , with row i being the cyclic shift of row
i 1, i = 2, . . . , N d .
Conversely, for any polynomial g(X)|(X N e), the set "g(X)# =
{ f (X) : f (X) = g(X)h(X), h(X) Fq [X]/"X N e#} is an ideal in
Fq [X]/"X N e#, i.e. a cyclic code X , and the above properties (b)(d) hold.
Proof Take g(X) F2 [X] a non-zero polynomial of the least degree in X . Take
p(X) X and write
p(X) = q(X)g(X) + r(X), with deg r(X) < deg g(X).
Then r(X) mod (X N 1) belongs to X . This contradicts the choice of g(X) unless
r(X) = 0. Therefore, g(X)|p(x) which proves (i). Taking p(X) = X N 1 proves (ii).
Finally, if g(X) and g(X) both satisfy (i) and (ii) then g(X)|g(X) and g(X)|g(X),
implying g(X) = g(X).
Corollary 3.3.3 The cyclic codes of length N are in a one-to-one correspondence
with factors of X N e. In other words, the map
+
*
+
*
cyclic codes of length N
divisors of X N 1 ,
X g(X),

is a bijection.

302

Further Topics from Coding Theory

With the identification

*
+
F2 [X]/"(X N 1)# = f F2 [X] : deg( f ) < N = FN2
the cyclic codes become ideals in the polynomial ring F2 [X]/"(X N 1)#. They are
in a one-to-one correspondence with the ideals in F2 [X] containing polynomial
X N 1. Because F2 [X] is a Euclidean domain, all ideals in F2 [X] are principal, i.e.
of the form { f (X)g(X) : f (X) F2 [X]}. In fact, all ideals in F2 [X]/"(X N 1)# are
also principal ideals.
Definition 3.3.4 The polynomial g(X) is called the minimal degree generator
(or simply the generator) of the cyclic code X . The ratio h(X) = (X N e)/g(X),
of degree N deg g(X), is called the check polynomial for the cyclic code X =
"g(X)#.
Example 3.3.5

X e generates the parity-check code {x : xi = 0} and e + X +

+ X N1 the repetition code {a . . . a, a Fq }; X e generates X = H .

Worked Example 3.3.6 (a) A cyclic code X = "g(X)# of length N is called
reversible if c0 . . . cN1 X implies cN1 . . . c0 X . Prove that X is reversible
iff g( ) = 0 implies g( 1 ) = 0.
(b) A cyclic code is called degenerate if, for some r|N , each codeword c X is a
concatenation c c c of N/r copies of some string c of length r. Prove that X
is degenerate iff its check polynomial h(X)|(X r 1).

[Hint: Prove that the generating polynomial g(X) = a(X) 1 + X r + X 2r + +
X Nr . ]
Solution (a) If the code X = "g(X)# is reversible and g = g0 . . . gNk 0 . . . 0
then X N1 g(X 1 ) 0 . . . 0gNk . . . g0 X , i.e. X N1 g(X 1 ) = g(X)q(X). Thus,
if g( ) = 0 then N1 g( 1 ) = 0, i.e. g( 1 ) = 0.
Conversely, g( ) = 0 implies g( 1 ) = 0. Suppose that c(X) X then
g(X)|c(X). Moreover, X N1 c(X 1 ) has all zeros of g(X) among its roots, and so
belongs to X . But X N1 c(X 1 ) cN1 . . . c0 , so X is reversible.
(b) The condition g = a . . . a means g(X) = a(X)(e + X r + X 2r + + X Nr ). On
the other hand,
X N e = (X r e)(X Nr + + X r + e) = h(X)g(X).
Thus, if X = "g(X)# is degenerate then X r e = h(X)a(X), i.e. h(X)|(X r e).

3.3 Cyclic codes revisited. Decoding the BHC codes

303

Conversely, if h(X)|(X r e) then X r e = a(X)h(X) and

X N e = (X r e)(X Nr + + X r + e)
= h(X)a(X)(X Nr + + X r + e).
Then
g(X) = a(X)(X Nr + + X r + e),
i.e. g = a . . . a . Furthermore, any c(X) X is of the form c(X) = q(X)g(X) where
deg q(X) N deg g(X). Write
c(X) = q(X)g(X) = a(X)q(X)(X Nr + + X r + e);
we conclude that deg a(X)q(X) < r (after multiplying by X Nr the degree cannot
exceed N 1). Then c = c . . . c is the concatenation c . . . c where c a(X)q(X).
Worked Example 3.3.7 Show that Hammings [7, 4] code is a cyclic code with
check polynomial X 4 + X 2 + X + 1. What is its generator polynomial? Does Hammings original code contain a subcode equivalent to its dual?
Solution In F72 we have
X 7 1 = (X 3 + X + 1)(X 4 + X 2 + X + 1).
The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is

1 0 1 1 1 0 0
0 1 0 1 1 1 0 .
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, it is equivalent to
Hammings [7, 4] code.
The dual of Hammings [7, 4] code has generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hammings [7, 4] code.
Worked Example 3.3.8 Let be a primitive N th root of unity. Let X = "g(X)#
be a cyclic code of length N . Show that the dimension dim (X ) equals the number
of powers j such that g( j ) = 0.
Solution Denote E(N) = { , 2 , . . . , N = e}, dim"g(X)# = N d, d = deg g(X).
But g(X) = (X i j ) where i1 , . . . , id are the zeros of "g(X)#. Hence, the
1 jd

remaining N d roots of unity l satisfy the condition g( l ) = 0.

304

Further Topics from Coding Theory

It is important to note that the generator polynomial of a cyclic code X = "g(X)#

is not unique. In particular, there exists a unique polynomial i(X) X such that
i(X)2 = i(X) and X = "i(X)# (an idempotent generator).
Theorem 3.3.9 If X1 = "g1 (X)# and X2 = "g2 (X)# are cyclic codes with generators g1 (X) and g2 (X) then
(a) X1 X2 iff g2 (X)|g1 (X),
(b) X1 X2 = "lcm (g1 (X), g2 (X))#,
(c) X1 |X2 = "gcd (g1 (X), g2 (X))#.
Theorem 3.3.10

Let h(X) be the check polynomial for X . Then

(a) X = { f (X): f (X)h(X) = 0 mod (X N e)},

(b) if h(X) = h0 + h1 X + + hNr X Nr then the parity-check matrix H of X is

hNr hNr1 . . . h1
h0
0 0 ... 0
0
h1
h0 0 . . . 0
hNr . . . . . .
,
H =
...
...
... ...
...
... ... ... ...
0
0
. . . hNr hNr1 . . . . . . . . . h0
(c) the dual code X is a cyclic code of dim X = r, and X = "g (X)#, where
Nr h(X 1 ) = h1 (h X Nr + h X Nr1 + + h
g (X) = h1
0
1
Nr ).
0 X
0
The generator g(X) of a cyclic code is specified, in terms of factorisation of
e, as a sub-product,

X N e = lcm M (X) : E(N) ,
(3.3.2)

of some minimal polynomials M (X). A convenient way is to characterise a cyclic

code via roots of g(X). If is a root of M (X) in an extension field Fq ( ) then
M (X) is the minimal polynomial for over Fq . For any polynomial f (X) Fq [X]
we have f ( ) = 0 iff f (X) = a(X)M (X), and if in addition f (X) Fq [X]/"X N
e# then f ( ) = 0 iff f (X) "M (X)#. Hence we get
Theorem 3.3.11 Let g(X) = q1 (X) . . . qt (X) be a product of irreducible factors
of X N e, and 1 , . . . , u be the roots of g(X) in Spl(X N e) over Fq . Then
"g(X)# = { f (X) Fq [X]/"X N e# : f (1 ) = = f (u ) = 0}.

(3.3.3)

Furthermore, it is enough to pick up a single root of each irreducible factor: if

j is any root of M (X), 1 j t, then
"g(X)# = { f (X) Fq [X]/"X N e# : f (1 ) = = f (t ) = 0}.

(3.3.4)

3.3 Cyclic codes revisited. Decoding the BHC codes

305

Conversely, if 1 , . . . , u is a set of roots of X N e then the code { f (X)

Fq [X]/"X N e# : f (1 ) = = f (u ) = 0} has a generator which is the lcm
of the minimal polynomials for 1 , . . . , u .
Definition 3.3.12 The roots of generator g(X) are called the zeros of the cyclic
code "g(X)#. Other roots of unity are often called non-zeros of the code.
Let {1 , . . . , u } be a set of roots of X N e lying in an extension field Fql . Recall
that l is the minimal integer such that N|ql 1. If f (X) = fi X i is a polynomial
in Fq [X]/"X N e# then f ( j ) = 0 iff fi ij = 0. Representing Fql as a vector
0iu

ij of length
space over Fq of dimension l, we associate ij with a (column) vector

ij = fi ij = 0. So, the (ul) N matrix

l over Fq , writing the last equality as fi
i

0
2
T
H = .
..

11 . . .

12 . . .
..
..
.
.

...
u

N1
1

N1
2

N1
u

(3.3.5)

can be considered as a parity-check matrix for the code with zeros 1 , . . . , u (with
the proviso that its rows may not be linearly independent).
Theorem 3.3.13 For q = 2, the Hamming [2l 1, 2l l 1, 3] code is equivalent
i
to a cyclic code "M (X)# = 0il1 (X 2 ) where is a primitive element in
F 2l .
Proof Let be a primitive (N, F2 ) root of unity where N = 2l 1. The splitting
field Spl (X N e) is F2l (as ordN (2) = l). So, is a primitive element in F2l .
l1
Take M (X) = (X )(X 2 ) X 2 , of degree l. The powers 0 =
e, , . . . , N1 form F2l , the list of the non-zero elements and the columns of the
l N matrix
0

H=
,
,...,
N1
(3.3.6)
consist of all non-zero binary vectors of length l. Hence, the Hamming [2l 1, 2l
l 1, 3] code is (equivalent to) the cyclic code "M (X)# whose zeros consist of a
primitive (2l 1; F2 ) root of unity and (necessarily) all the other roots of the
minimal polynomial for .
Theorem
3.3.14 If gcd(l, q 1) = 1 then the
l
l
q 1 q 1
,
l, 3 code is equivalent to the cyclic code.
q1 q1

q-ary

Hamming

306

Further Topics from Coding Theory

1
Proof Write Spl(X N e) = Fql where l = ordN (q), N = qq1
. To justify the
l
q 1
= q 1 and l is the least positive integer with
selection of l observe that
N
l
1
> ql1 1.
this property as qq1
l

Therefore, Spl(X N e) = Fql . Take a primitive Fql . Then = (q 1)/N =

is a primitive (N, Fq ) root of unity. As before, take the minimal polynomial

l1
M (X) = (X )(X q ) X q
and consider the cyclic code "M (X)#
l1
q
with the zero (and necessarily , . . . , q ). Consider again the l N matrix
(3.3.6). We want to check that any two distinct columns of H are linearly independent. If not, there exist i < j such that i and j are scalar multiples of the element
ji Fq . But then ( ji )q1 = ( ji)(q1) = e in Fq ; as is a primitive Nth root
of unity, this holds iff ( j i)(q 1) 0 mod N. Write
l

ql 1
= 1 + + ql1 .
q1

As (q 1)|(qr 1) for all r 1, we have qr = (q 1)vr + 1 for some natural vr .

Summing over 0 r s 1 yields
N = (q 1) vr + l.

(3.3.7)

As gcd(q 1, l) = 1 we have gcd(q 1, N) = 1. But then the equality ( j i)

(q 1) = 0 mod N is impossible.
So, the code with the parity-check matrix H has length N, rank k N l and
distance d 3. But the Hamming bound says that

1

d 1
N
qk qN
m (q 1)m , E = 2 .
0mE
As the volume of the ball vN,q (E) ql , this implies that in fact k = N l, E = 1
and d = 3. So, this code is equivalent to a Hamming code.
Next, we look in more detail on BCH codes correcting several errors. Recall that
if 1 , . . . , u E(N,q) are (N, Fq ) roots of unity then
XN = { f (X) Fq [X]/"X N e# : f (1 ) = = f (u ) = 0}
is a cyclic code "g(X)# where

g(X) = lcm M1 ,Fq (X), . . . , Mu ,Fq (X)(X)

(3.3.8)

is the product of distinct minimal polynomials for 1 , . . . , u over Fq . In particular,

if q = 2, N = 2l 1, and is a primitive element in F2l then the cyclic code with
l1
roots , 2 , . . . , 2 (which is the same as with a single root ) coincides with

3.3 Cyclic codes revisited. Decoding the BHC codes

307

"M (X)# and is equivalent to the Hamming code. We could try other possibilities
for zeros of X to see if it leads to interesting examples. This is the way to discover
the BCH codes [25], [70].
Recall the factorisation into minimal polynomials Mi (X)(= M i ,Fq (X)),

(3.3.9)
X N 1 = lcm Mi (X) : i = 0, . . . ,t ,
where is a primitive (N, Fq ) root of unity. The roots of Mi (X) are conjugate,
d1
i.e. have the form i , iq , . . . , iq where d(= d(i)) is the least integer 1 such
that iqd = i mod N. The set Ci = {i, iq, . . . , iqd1 } is the ith cyclotomic coset of q
mod N. So,
Mi (X) =

(X j ).

(3.3.10)

jCi

In Section 3.2, we obtained a cyclic code of minimal distance by requiring

that the generator g(X) has ( 1) successive roots (with successive exponents).
Compare Theorem 3.3.16 below.
Example 3.3.15 A binary Hamming code is a binary primitive narrow sense
BCH of designed distance = 3.
BCH
N
s
By Lemma 3.2.8, the distance d Xq,N,
. As Spl(X e) = Fq where s =
ordN (q), we have that
deg M b+ j (X) s.

(3.3.11)

BCH ) = N deg(g(X)) N ( 1)s. So:

Hence, the rank (Xq,N,

BCH has distance and rank

Theorem 3.3.16 The q-ary BCH code Xq,N,

N ( 1) ordN (q).

As before, we can form a parity-check matrix for X BCH by writing

b , b+1 , . . . , b+ 2 and their powers as vectors from Fsq where s = ordN (q). Set

e
...

b
b+1
...
b+ 2

T
2b
2(b+1)
2(b+

2)
.
H =
(3.3.12)

...

(N1)b
(N1)(b+1) . . .
(N1)(b+ 2)
The proper parity-check matrix H is obtained by removing redundant rows.
The binary BCH codes are simplest to deal with. Let Ci = {i, 2i, . . . , i2d1 } be
the ith cyclotomic coset (with d(= d(i)) being the smallest non-zero integer such
that i 2d = i mod N). Then u Ci iff 2u mod N Ci . So, Mi (X) = M2i (X), and for
all s 1 the polynomials
g2s1 (X) = g2s (X) = lcm{M1 (X), M2 (X), . . . , M2s (X)}.

308

Further Topics from Coding Theory

BCH
BCH . So we can focus on the narrow
We immediately deduce that X2,N,2s+1
= X2,N,2s
sense BCH codes with odd designed distance = 2E + 1, and obtain an improvement of Theorem 3.3.16:

Theorem 3.3.17

BCH
The rank of a binary BCH code X2,N,2E+1
is N E ordN (2).

The problem of determining exactly the minimum distance of a BCH code has
been solved only partially (although a number of results exist in the literature). We
present the following theorem without proof.
Theorem 3.3.18 The minimum distance of a binary primitive narrow sense BCH
code is an odd number.
The previous results can be sharpened in a number of particular cases.
Worked Example 3.3.19

Prove that log2 (N + 1) > 1 + log2 (E + 1)! implies

N
E
(N + 1) <
.
(3.3.13)
0iE+1 i

Solution For i E +1 we obtain i! (E +1)! < (N +1)/2. Hence, (3.3.13) follows

from
(N + 1)E+1 2

N(N 1) . . . (N i + 1) = S(E).

(3.3.14)

0iE+1

Inequality (3.3.14) holds for E = 0, and is checked by induction in E. Write the

RHS of (3.3.14) as S(E + 1) = S(E) + N(N 1) . . . (N E). Then S(E) > (N +
1)E+1 by the induction hypothesis and it remains to check
N(N +1)E+1 < 2N(N 1) . . . (N E)(N E 1), for N +1 > 2(E +2)!. (3.3.15)
Consider the polynomial (y + 1)E+1 2(y 1) . . . (y E)(y E 1) and group
together the monomials of degrees E + 1 and E. Clearly, they are negative for
y > 2(E + 2)!. Continue this procedure, concluding that (3.3.13) holds.

N
sE
s
Theorem 3.3.20 Let N = 2 1. If 2 <
then a primitive binary
i
0iE+1
BCH
narrow sense BCH code X2,2
s 1,2E+1 has distance 2E + 1.
BCH
Proof By Theorem 3.3.18, the distance is odd. So, d(X2,2
2E + 2.
s 1,2E+1 ) =
BCH
Suppose the distance is 2E + 3. Observe that the rank X2,2s 1,2E+1 N sE,
and use the Hamming bound

N
N
NsE
N
sE
2
i 2 , i.e. 2 i .
0iE+1
0iE+1
BCH
The contradiction implies d(X2,2
s 1,2E+1 ) = 2E + 1.

3.3 Cyclic codes revisited. Decoding the BHC codes

309

BCH
Corollary 3.3.21 If N = 2s 1 and s > 1 + log2 (E + 1)! then d(X2,2
s 1,2E+1 ) =
2E + 1. In particular, let N = 31 and s = 5. Then we easily verify that

0iE+1

31
i

BCH in fact equals

for E = 1, 2 and 3. This proves that the actual distance of X2,31,d
for = 3, 5, 7.

N
Proof s > 1 + log2 (E + 1)! implies that 2sE <
.
i
0iE+1

Theorem 3.3.22 If |N , the minimum distance of primitive binary narrow sense

BCH code of designed distance equals .
Proof

Set N = m, then
X N 1 = X m 1 = (X m 1)(1 + X m + + X ( 1)m ).

As jm = 1 for j = 1, . . . , 1, none of , 2 , . . . , 1 is a root of X m 1.

So, they must be roots of 1 + X m + + X ( 1)m . Then this polynomial gives a
codeword, of weight . So, is the distance.
Two more results on the minimal distance of a BCH code are presented in Theorems 3.3.23 and 3.3.25. The full proofs are beyond the scope of this book and
omitted.
Theorem 3.3.23 Let N = qs 1. The minimal distance of a primitive q-ary narBCH
k
k
row sense BCH code Xq,q
s 1,qk 1, ,1 of designed distance q 1 equals q 1.
Theorem 3.3.24 The minimal distance of a primitive q-ary narrow sense BCH
BCH
code X BCH = Xq,q
s 1, , ,1 of designed distance is at most q 1.
Proof Take k to be an integer 1 with qk1 qk 1. Set = qk 1 and
BCH
consider X (= Xq,q
s 1, , ,1 ), the q-ary primitive narrow sense BCH code of
the same length N = qs 1 and designed distance . The roots of the generator
of X are among those of X , so X X . But according to Theorem 3.3.22,
d(X ) = which is q 1.
The following result shows that BCH codes are not asymptotically good. However, for small N (a few thousand or less), the BCH are among the best codes
known.

310

Further Topics from Coding Theory

Theorem 3.3.25
There exists no infinite sequence of q-ary primitive BCH
codes XNBCH of length N such that d(XN )/N and rank(XN )/N are bounded away
from 0.
Decoding BCH codes can be done by using the so-called BerlekampMassey
algorithm. To begin with, consider a binary primitive narrow sense BCH code
BCH ) of length N = 2s 1 and designed distance 5. With E = 2 and
X BCH (= X2,N,5

N
sE
holds, and by Theorem 3.3.20, the distance
s 4, inequality 2 <
i
0iE+1
BCH
d X
equals 5. Thus, the code is two-error correcting. Also, by Theorem
3.3.17, the rank of X BCH is N 2s. [For s = 4, the rank is actually equal to
N 2s = 15 8 = 7.] So, X BCH is [2s 1, 2s 1 2s, 5].
The defining zeros are , 2 , 3 , 4 where is a primitive Nth root of unity
over F2 (which is also a primitive element of F2s ). We know that and 3
suffice as defining zeros: X BCH = {c(X) F2 [X]/"X N 1# : c( ) = c( 3 ) = 0}.
So, the parity-check matrix H in (3.3.12) can be taken in the form

e

2
N1
T
.
(3.3.16)
H =

e
3
6
3(N1)
It is instructive to compare the situation with the binary Hamming [2l 1, 2l
1l] code X ( H) . In the case of code X BCH , suppose again that a codeword c(X)
X was sent and the received word r(X) has 2 errors. Write r(X) = c(X) + e(X)
where the error polynomial e(X) now has weight 2. There are three cases to
consider: e(X) = 0, e(X) = X i or e(X) = X i + X j , 0 i = j N 1. If r( ) = r1
and r( 3 ) = r3 then e( ) = r1 and e( 3 ) = r3 . In the case of no error (e(X) = 0),
r1 = r3 = 0, and vice versa. In the single-error case (e(X) = X i ),
r3 = e( 3 ) = 3i = ( i )3 = (e( ))3 = r13 = 0.
Conversely, if r3 = r13 = 0 then e( 3 ) = e( )3 . If e(X) = X i + X j with i = j then

3i + 3 j = ( i + j )3 = 3i + 2i j + i 2 j + 3 j ,
i.e. 2i j + i 2 j = 0 or i + j = 0 which implies i = j, a contradiction. So, the
single error occurs iff r3 = r13 = 0, and the wrong digit is i such that r1 = i . So,
in the single-error case we identify a column of H, i.e. a pair ( i , 3i ) = (r1 , r3 )
and change digit i in r(X). This is completely similar to the decoding procedure for
Hamming codes.
In the two-error case (e(X) = X i + X j , i = j), in the spirit of the Hamming codes,
we try to find a pair of columns ( i , 3i ) and ( j , 3 j ) such that the sum ( i +
j , 3i + 3 j ) = (r1 , r3 ), i.e. solve the equation
r1 = i + j , r3 = 3i + 3 j .

3.3 Cyclic codes revisited. Decoding the BHC codes

311

Then find i, j such that y1 = i , y2 = j (y1 , y2 are called error locators). If such i,
j (or equivalently, error locators y1 , y2 ) are found, we know that errors occurred at
positions i and j.
It is convenient to introduce an error-locator polynomial (X) whose roots are
1
y1
1 , y2 :

(X) = (1 y1 X)(1 y2 X) = 1 (y1 + y2 )X + y1 y2 X 2

= 1 r1 X + (r3 r11 r12 )X 2 .

(3.3.17)

As y1 + y2 = r1 , we check that y1 y2 = r3 r11 r12 . Indeed,

r3 = y31 + y32 = (y1 + y2 )(y21 + y1 y2 + y22 ) = r1 (r12 + y1 y2 ).
If N is not large, the roots of (X) can be found by trying all 2s 1 non-zero
elements of F2s . (The standard formula for the roots of a quadratic polynomial
does not apply over F2 .) Thus, the following assertion arises:
Theorem 3.3.26 For N = 2l 1, consider a two-error correcting binary primitive narrow sense BCH code X (which equals X BCH ) of length N and designed
distance 5, with the parity-check matrix produced from

e 2 N1
T
H =
,
e 3 6 3(N1)

where is the primitive element of F2s . [The rank of the code is N 2l and
for l 4 the distance equals 5, i.e. X is [2l 1, 2l 1 2l, 5] and corrects two
errors.] Assume that at most two errors occurred in a received word r(X) and let
r( ) = r1 , r( 3 ) = r3 . Then:
(a) if r1 = 0 then r3 = 0 and no error occurred;
(b) if r3 = r13 = 0 then a single error occurred at position i where r1 = i ;
(c) if r1 =
0 and r3 = r13 then two errors occurred: the error locator polynomial
(X) = 1 r1 X + (r3 r11 r12 )X 2 has two distinct roots N1i , N1 j and
the errors occurred at positions i and j.
For a binary BCH code with a general designed distance ( = 2t +1 is assumed
odd), we follow the same idea: compute
r1 = e( ), r3 = e( 3 ), . . . , r 2 = e( 2 )
for the received word r(X) = c(X) + e(X). Suppose that errors occurred at places
i1 , . . . , it . Then
e(X) =

1 jt

Xij.

312

Further Topics from Coding Theory

As before, consider the system

i j = r1 ,

1 jt

3i j = r3 , . . . ,

1 jt

( 2)i j = r 2 ,

yj 2 = r 2 .

1 jt

and introduce the error locators y j = i j :

y j = r1 ,

1 jt

y3j = r3 , . . . ,

1 jt

The error locator polynomial

(X) =

(1 y j X)

1 jt
i
has the roots y1
j . The coefficients i in (X) = i X can be determined from
0it

the equations below

1
r2

.
..

r2t4
r2t2

0
r1
r3
..
.
r2t5
r2t3

0
1
r2
..
.
..
.
...

0
0
r1
..
.
..
.
...

0
0
1
..
.
..
.
...

...
...
...
..
.
..
.
...

0
0
0
..
.
rt3
rt1

1
2
3
..
.
2t3
2t1

r1
r3
r5
..
.
r2t3
r2t1

This requires computing rk only for k odd as

r2 j = e( 2 j ) = e( j )2 = r2j .
Once the i are found, the roots y1
j can be determined by trial and error.
BCH

Example 3.3.27 Consider X2,16,

,5 where is a primitive element of F16 . We
know that the primitive polynomial is M1 (X) = X 4 +X +1 and M3 (X) = X 4 +X 3 +
X 2 + X + 1. Hence, the generator of the code

g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.

3.4 The MacWilliams identity and the linear programming bound

313

Let us introduce two errors in the codeword c = 10001011100000000 at the 4th

and 12th positions by taking a(X) = X 12 + X 8 + X 7 + X 6 + 1. Then
r1 = a( ) = 12 + 8 + 7 + 6 + 1 = 6 ,
r3 = a( 3 ) = 36 + 24 + 21 + 18 + 1 = 9 + 3 + 1 = 4 .
Since r3 = r13 , consider the location polynomial

(X) = 1 + 6 X + ( 13 + 12 )X 2 .
The roots of l(X) are 3 and 11 by the direct check. Hence we discover the errors
at the 4th and 12th positions.

3.4 The MacWilliams identity and the linear programming bound

The MacWilliams identity for linear codes deals with the so-called weightenumerator polynomials WX (z) and WX (z) where X and X are a pair of dual
codes of a given length N. The polynomials WX (z) and WX (z) are defined by
WX (z) =

Ak zk and WX (z) =

0kN

k
A
kz

(3.4.1)

0kN

where Ak (= Ak (X )) equals the number of codewords of weight k in X , and A

k
(= Ak (X )) the number in X . The identity for q-ary codes reads

N
1z
1
1 + (q 1)z WX
, z C,
(3.4.2)
WX (z) =
X
1 + (q 1)z
and takes a particularly elegant form in the binary case (q = 2):

1z
1
n
(1 + z) WX
WX (z) =
.
X
1+z

(3.4.3)

A short derivation of the abstract MacWilliams identity is rather algebraic. It

may be skipped at the first reading as only its specification for linear codes will be
used later on.
Definition 3.4.1 Let (G, +) be a group. A homomorphism : G to the multiplicative group of complex numbers S = {z C : |z| = 1} is called a (one-dimensional)
character of G. Since is a homomorphism

(g1 + g2 ) = (g1 ) (g2 ), (0) = 1.

We say is trivial (or principal) if () 1.

(3.4.4)

314

Further Topics from Coding Theory

More generally, a linear representation D of a group G over a field F (not necessarily finite) is defined as a homomorphism
D : G GL(V ) : g D(g)

(3.4.5)

from G into the group GL(V ) of invertible linear mappings of a finite-dimensional

space V over F. The vector space V is called the representation space and its dimension dim(V ) is called the dimension of representation.
Let D be a representation of a group G. Then the map

D : G F : g dii (g) = trace D(g) ,
(3.4.6)
which takes g G to D (g), the trace of D(g) = (di j (g)), is called the character of
D. Representations and characters over the field C of complex numbers are called
ordinary. In the situation where the underlying field F is finite, they are called
modular.
In our case G = Fq with additive group operation. Fix a primitive qth root of
unity = e2 i/q S and for any j Fq define a one-dimensional representation
of the group Fq as follows:

( j) : Fq S : u ju .
The character ( j) is non-trivial for j = 0. In fact, all characters of Fq can be described in this way, but we omit the proof of this assertion.
Next, we define a character of the group G = FNq . Fix a non-trivial onedimensional ordinary character : Fq S and a non-zero element v FNq and
define a character of the additive group G = FNq as follows:

(v) : FNq S : u (v u),

(3.4.7)

where v u, as before, is the dot-product.

Lemma 3.4.2
Then

Let be a non-trivial (i.e. 1) character of a finite group G.

(g) = 0.

(3.4.8)

If is trivial then (g) = G.

Proof Since is non-trivial, there exists an element h G such that (h) = 1.

From

(h) (g) =
gG

(hg) = (g),

we obtain that ( (h) 1) (g) = 0. Therefore, (g) = 0.

3.4 The MacWilliams identity and the linear programming bound

315

In the case G = FNq , (x) = qN for a trivial.

xFN
q

Definition 3.4.3 The discrete Fourier transform (in short, DFT) of a function f
on FNq is defined by
f =

f (v)(v) .

(3.4.9)

vFN
q

Sometimes, the weight enumerator polynomial of code X is defined as a function of two formal variables x, y:
WX (x, y) =

xw(v) yNw(v)

(3.4.10)

(if one sets x = z, y = 1, (3.4.10) coincides with (3.4.1)). So, we want to apply the
DFT to the function (no harm to say that x, y S )
g : FNq C [x, y] : v xw(v) yNw(v) .
Lemma 3.4.4

(3.4.11)

(The abstract MacWilliams identity) For v FNq let

g : FNq C [x, y] : v xw(v) yNw(v) .

(3.4.12)

g(u) = (y x)w(u) (y + (q 1)x)Nw(u) .

(3.4.13)

Then
Proof Let denote a non-trivial ordinary character of the additive group G = Fq .
Given Fq , set | | = 0 if = 0 and | | = 1 otherwise. Then for all u FNq we
compute

g(u) = "v, u# g(v)
vFN
q

"v, u# xw(v) yNw(v)

...

vFN
q

v0 Fq

vN1 Fq

vi ui

|v |+|v | (1|v |)++(1|v |)

N1
0
N1
y
x 0

i=0
N1

v0 Fq

(vi ui )x|v | y1|v |

vN1 Fq i=0

(gui )x|g| y1|g| .

i=0 gG

If ui = 0 then (gui ) = (0) = 1 and so

x|g| y1|g| = y + (q 1)x.

316

Further Topics from Coding Theory

If ui = 0 then

(gui )x|g| y1|g| = y +

(gui )x = y (0)x = y x.

gG\0

Lemma 3.4.5 (MacWilliams identity for linear codes) If X is a linear [N, k]

code over Fq then

f(x) = qk

f (y).

(3.4.14)

Proof Consider the following sum:

f(x) =

"v, x# f (v)

vFN
q xX

(v) (x) f (v)

xX vFN
q

"v, x# f (v)

vX xX

"v, x# f (v).

vFN
q \X xX

In the first sum we have "v, x# = (0) = 1 for all v X and all x X . In
the second sum we study the linear form
X Fq : x "v, x#.
Since v FNq \ X , this linear form is surjective, whence its kernel has dimension
k 1, i.e. for any g Fq there exist qk1 vectors x X such that "v, x# = g. This
implies

f(x) = qk

f (y) + qk1

f (y)

= qk

vFnq \X

f (v) (g)
gG

as the second term vanishes by Lemma 3.4.2.

Lemma 3.4.6 The weight enumerator of an [N, k] code X over Fq is related to
the weight enumerator of its dual as follows:
WX (x, y) = qkWX (y x, y + (q 1)x).

(3.4.15)

3.4 The MacWilliams identity and the linear programming bound

Proof

317

By Lemma 3.4.5 with g(v) = xw(v) yNw(v)

WX (x, y) =

g(v) = qk

vX
k

g(v)

= q WX (y x, y + (q 1)x).
Substituting x = z, y = 1 we obtain (3.4.3).
Example 3.4.7 (i) For all codes X , WX (0) = A0 = 1 and WX (1) = X . When
N
X = FN
q , WX (z) = [1 + z(q 1)] .
(ii) For a binary repetition code X = {0000, 1111}, WX (x, y) = x4 + y4 . Hence,
WX (x, y) =

1
(y x)4 + (y + x)4 = y4 + 6x2 y2 + x4 .
2

(iii) Let X be the Hamming [7, 4] code. The dual code X has 8 codewords; all
except 0 are of weight 4. Hence, WX (x, y) = x7 + 7x4 y3 , and, by the MacWilliams
identity,

1
1
WX (x y, x + y) = 3 (x y)7 + 7(x y)4 (x + y)3
3
2
2
= x7 + 7x4 y3 + 7x3 y4 + y4 .

WX =

Hence, X has 7 words of weight 3 and 4 each. Together with the 0 and 1 words,
this accounts for all 16 words of the Hamming [7, 4] code.
Another way to derive the identity (3.4.1) is to use an abstract result related
to group algebras and character transforms for Hamming spaces FN
(which are
q
linear spaces over field Fq of dimension N). For brevity, the subscript q and superscript (N) will be often omitted.
Definition 3.4.8 The (complex) group algebra CFN for space FN is defined as
the linear space of complex functions G : x FN G(x) C equipped by a complex involution (conjugation) and multiplication. Thus, we have four operations for
functions G(x); addition and scalar (complex) multiplication are standard (pointwise), with (G + G )(x) = G(x) + G (x) and (aG)(x) = aG(x), G, G CFN ,
a C, x FN . The involution is just the (point-wise) complex conjugation:
G (x) = G(x) ; it is an idempotent operation, with G = G. However, the multiplication (denoted by ) is a convolution:
(G G )(x) =

G(y)G (x y), x FN .

(3.4.16)

yFN

This makes CFN a commutative ring and at the same time a (complex) linear
space, of dimension dim CFN = qN , with involution. (A set that is a commutative

318

Further Topics from Coding Theory

ring and a linear space is called an algebra.) The natural basis in CFN is formed
by Diracs (or Kroneckers) delta-functions y , with y (x) = 1(x = y), x, y H .
If X FN is a linear code, we set GX (x) = 1(x X ).
The multiplication rule (3.4.16) requires an explanation. If we rewrite the
G(y)G(y ) (which makes the commuRHS in a symmetric form

y,y FN :y+y =x

tativity of the -multiplication obvious) then there will be an analogy with the
multiplication of polynomials. In fact, if A(t) = a0 + a1t + + al1t l1 and

A (t) = a 0 + a 1t + + a l 1t l 1 are two polynomials, with coefficient strings
(a0 , . . . , al1 ) and (a 0 , . . . , a l 1 ), then the product B(t) = A(t)A (t) has a string
am a m .
of coefficients (b0 , . . . , bl1+l 1 ) where bk =

m,m 0:m+m =k

From this point of view, rule (3.4.16) is behind some polynomial-type multiplication. Polynomials of degree n 1 form of course a (complex) linear space of
dimension n. However, they do not form a group (or even a semi-group). To make
a group, we should affiliate inverse monomials 1/t, 1/t 2 , and so on, and either consider infinite series or make an agreement that t n = 1 (i.e. treat t as an element of
a cyclic group, not a free variable). Similar constructions can be done for polynomials of several variables, but there we have a variety of possible agreements on
relations between variables.
Returning to our group algebra CH , we make the following steps:
(i) Produce a multiplicative version of the Hamming group H . That is, take a
collection of formal variables t (x) labelled by elements x H and postulate the

rule t (x)t (x ) = t (x+x ) for all x, x CH .
(ii) Then consider the set TH of all (complex) linear combinations G =
xH xt (x) and introduce (ii1) the addition G + G = xH (x + x )t (x) and (ii2)
the scalar multiplication aG = xH (ax )t (x) , G, G TH , a C. We again obtain a linear space of dimension qN , with the basis formed by basic combinations t (x) , x H . Obviously, TH and CH are isomorphic as linear spaces, with
G g.

(iii) Now remove brackets in t (x) (but keep the rule t xt x = t x+x ) and write
xH xt x as g(t) thinking that this is a function (in fact, a polynomial) of some
variable t obeying the above rule. Finally, consider the polynomial multiplication
g(t)g (t) in TH . Then TH and CH become isomorphic not only as linear spaces
but also as rings, i.e. as algebras.
The above construction is very powerful and can be used for any group, not just
for HN . Its power will be manifested in the derivation of the MacWilliams identity.
So, we will think of CH as a set of functions
g(t) =

xHn

xt x

(3.4.17)

3.4 The MacWilliams identity and the linear programming bound

x+x

319

of a formal variable t obeying an exponentiation rule: t

= t xt , with addition
and multiplication of formal polynomials.
In agreement with (3.4.17), for a linear code X Hn we set
gX (t) =

t x;

(3.4.18)

gX (t) is often called the generating function of X .

Definition 3.4.3 admits a straightforward generalisation for any non-principal
character : F S. Note the similarity with the Fourier transform (and other types
of popular transforms (viz. the Hadamard transform in the group theory)).
Definition 3.4.9
defined by

The character transform g g of the group algebra CHn is

g(t)
=

Xx (g)t x ,

(3.4.19a)

y (x y)

(3.4.19b)

xHn

where g (x , x Hn ) and
Xx (g) =

yHn

and x y is the dot-product 1 jn x j y j in Hn .

Now define the weight enumerator of a group algebra element g CH as a
polynomial Wg (s) in a variable s (which may be thought of as a complex variable):

n
w(x)
=
(3.4.20)
Wg (s) = x s
x sk = Ak sk , s C.
xH

k=0

x:w(x)=k

0kn

Here
Ak =

x .

(3.4.21)

xH :w(x)=k

For a linear code X , with generating function gX (t) (see 3.4.18)), Ak gives the
number of codewords of weight k:
Ak = #{x X : w(x) = k}.

(3.4.22)

The weight enumerator Wg (s) of the character transform g of g (x , x H ) is

given by

w(x)
=
(3.4.23)
Wg (s) = Xx (g)s
Xx (g) sk = A k sk ,
xH

0kn

x: w(x)=k

320

Further Topics from Coding Theory

where

A k =

Xx (g).

(3.4.24)

xH : w(x)=k

The abstract MacWilliams identity is established in the following result.

Theorem 3.4.10

We have

Wg (s) = (1 + (q 1)s) Wg
n

1s
.
1 + (q 1)s

(3.4.25)

Proof Basically coincides with that of Lemma 3.4.4.

Rewrite (3.4.25) in terms of coefficients Ak and A k :
n

k=0

A k sk = Ak (1 s)k (1 + (q 1)s)nk

(3.4.26)

and expand:
n

(1 s)k (1 + (q 1)s)nk = Ki (k)si .

(3.4.27)

i=0

Here Ki (k)(= Ki (k, n, q)) is a Kravchuk polynomial: for all i, k = 0, 1, . . . , n,

ik
k
nk
(1) j (q 1)i j ,
Ki (k) =

j
i

j
(3.4.28)
j=0(i+kn)
0 (i + k n) = max [0, i + k n], i k = min [i, k].
Then

A k sk =

0kn

Ki (k)si =

0in

Ak Ki (k)si

0in 0kn
k

Ai Kk (i)s ,

0kn 0in

i.e.
A k =

Ai Kk (i).

(3.4.29)

0in

Lemma 3.4.11 For any (linear) code X Hn , with generating function gX

1(x X ), the character transform coefficients are related by
Xu (gX ) = #X 1(u X )

(3.4.30)

gX = #X gX .

(3.4.31)

and the character transform

Here, X is the dual code.

3.4 The MacWilliams identity and the linear programming bound

Proof

By Lemma 3.4.2

321

Xu (gX ) = Xu

(y u) = #X 1(u X ).

In fact, the character y X (y u) is principal iff u X . Consequently,

g(t)
=

Xx (gX )t x =

= #X

#X 1(x X )t x

t x = #X gX (t).

Hence,
WgX (s) = #X WgX (s),

(3.4.32)

and we obtain the MacWilliams identity for linear codes:

Theorem 3.4.12

Let X Hn be a linear code, X its dual, and

WX (s) =

Ak sk ,

WX (s) =

k=0

Ak sk

(3.4.33)

k=0

the w-enumerators for X and X , respectively, with Ak = #{x X : w(x) = k}

and A k = #{x X : w(x) = k}. Then

1
1s
n
(1 + (q 1)s) WX
, s C,
(3.4.34)
WX (s) =
#X
1 + (q 1)s
or, equivalently,
A
k =

1
#X

Ai Kk (i),

(3.4.35)

0in

where Kk (i) are Kravchuk polynomials (see (3.4.28)).

For a binary code, i.e. q = 2, (3.4.34) takes the form (3.4.3). Sometimes the
weight enumerators are defined as
WX (s, r) = Ak sk rnk .

(3.4.36)

Then the MacWilliams identity (3.4.33) takes the form

1
WX (r s, r + (q 1)s).
(3.4.37)
#X
The MacWilliams identity is a powerful result providing a deep insight into the
structure of a (linear) code, particularly when the code is self-dual.
WX (s, r) =

322

Further Topics from Coding Theory

The MacWilliams identity helps to establish an interesting bound on linear codes

called the linear programming (LP) bound. First, we discuss some immediate consequences of this identity. If X HN,q is a code of size M, set
Bk =

1
{(x, y) : x, y X , (x, y) = k},
M
k = 0, 1, . . . , N

(each pair x, y is counted two times). The numbers B0 , B1 , . . . , BN form the distance
distribution of code X . The expression

BX (s) =

Bk sk

(3.4.38)

0kN

is called the distance enumerator of X . Clearly, the w- and d-distributions of a

linear code coincide. Furthermore we have
Lemma 3.4.13 The d -enumerator of an [N, M] code X coincides with the wenumerator of the group algebra element
1
X (s)X (s1 ))
M
where the generating function of X is
hX (s) :=

X (s) =

sx .

(3.4.39)

(3.4.40)

Proof Using the notation (s1 )x , write

hX (s) =

1
1
sx sy =

sxy
M xX yX
M x,yX

and hence
WhX (s) =

1
1 w(x y) = k sk = Bk sk

M 0kN x,yX :
0kN

= BX (s).
Now by the MacWilliams identity, for a given non-trivial character and the
corresponding transform , we obtain
Theorem 3.4.14 For hX (s) as above, if
hX (s) is the character transform and
Wh (s) its w-enumerator, with
X

k
Wh (s) = Bk s =
x (hX ) sk ,
X

0kN

w(x)=k

3.4 The MacWilliams identity and the linear programming bound

323

then
Bk =

Bi Kk (i),

0iN

where Kk (i) are Kravchuk polynomials.

The following assertion is straightforward.
Lemma 3.4.15 The following identity holds: x (X (s1 )) = x (X (s)), where
the bar denotes the complex conjugate.
With the help of Lemma 3.4.15, we can write
1
1
x (X (s)X (s1 )) = x (X (s))x (X (s1 ))
M
M
1
1
= x (X (s))x (X (s)) = |x (X (s))|2 ,
M
M

x (hX (t)) =

and so,
Bk =

x (hX ) =

x:w(x)=k

1
|x (X )|2 0.
M w(x)=k

Thus:
Theorem 3.4.16

For all [N, M] codes X and k = 0, . . . , N ,

Bi Kk (i) 0.

(3.4.41)

0iN

Now counting the number of pairs (x, y) X X :

Bi = M 2

0iN

Ei = M, with Ei =

0iN

1
Bi
M

(3.4.42)

(sometimes E0 , E1 , . . . , EN are called the d -distribution of X ). Then, by (3.4.41)

(3.4.42),

Ei Kk (i) 0.

0iN

In addition, by definition, Ei 0, 0 i N , and E0 = 1 and Ei = 0, 1 i < d .

Proof Let be a primitive qth root of unity and x FNq be a fixed word of weight
i. Then

yFN
q :w(y)=k

"x,y# = Kk (i).

(3.4.43)

324

Further Topics from Coding Theory

Indeed, we may assume that x = x1 x2 . . . xi 0 . . . 0 where the coordinates xi are not

0. Let D be a set of words that have their non-zero coordinates in a given set of
k positions. Suppose that exactly j positions h1 , . . . , hk belong to [0,
i]and
k j
i
N i
positions belong to [i + 1, N]. For such, a set could be selected in
j
k j
choices. Then

"x,y# =

...

yh1 Fq

yhk Fq
j

= (q 1)k j

xh1 yh1 ++xhk yhk

hi y

i=1 yFq

= (1) j (q 1)k j .

Hence,
N

M Bi Kk (i) =

"xy,z#

i=0 x,yX : (x,y)=i zFN

q :w(z)=k

i=0

"x,z# |2 0.

zFN
q :w(z)=k xX

This leads us to the so-called linear programming (LP) bound stated in Theorem
3.4.17 below.
(The LP bound) The following inequality holds:

Mq (N, d) max Ei : Ei 0, E0 = 1, Ei = 0 for 1 i < d

Theorem 3.4.17

0iN

and

Ei Kk (i) 0 for 0 k N . (3.4.44)

0iN

For q = 2, the LP bound will be slightly improved in Theorem 3.4.19. First, an

auxiliary result whose proof is straightforward and left as an exercise.
Lemma 3.4.18
(a) If there exists a binary [N, M, d] code, with d even, then there exists a binary
[N, M, d] code where any codeword has even weight, and so all distances are
even. So, if q = 2 and d is even, we may assume that Ei = 0 for all odd values
of i.
(b) For q = 2,
Ki (2k) = KNi (2k).

3.4 The MacWilliams identity and the linear programming bound

325

Hence, for d even, as we can assume that E2i+1 = 0, the constraint in (3.4.44)
need only be considered for k = 0, . . . , [N/2].
Ei K0 (i) 0 follows from Ei 0.

(c) K0 (i) = 1 for all i, and thus the bound

0iN

Lemma 3.4.18 directly implies

Theorem 3.4.19

(The LP for q = 2) If d is even then

M2 (N, d)

max

Ei : Ei 0, E0 = 1, Ei = 0 for 1 < i < d,

0iN

Ei = 0 for i odd, and

for k = 1, . . . ,

N
+ Ei Kk (i) 0
k
diN

(3.4.45)

A B
N
.
2

Since M2 (N, 2t + 1) = M2 (N + 1, 2t + 2), Theorem 3.4.19 provides a useful

bound also when d is odd. We will explore the MacWilliams identity further on.
The LP bound represents a rather universal tool in the theory of codes. For instance, the Singleton, Hamming and Plotkin bounds can all be derived from the LP
bound. However, we will not exploit this avenue in detail.
Worked Example 3.4.20

For positive integers N and d N , let

f (x) = 1 + f j K j (x)
j=1

be a polynomial such that f j 0, 1 j N and f (i) 0 for d i N . Prove that

Mq (N, d) f (0).

(3.4.46)

Derive the Singleton bound from (3.4.46).

Solution Let M = Mq (N, d) and X be a q-ary [N, M] code with the distance
distribution Bi (X ), i = 0, . . . , N. The condition f (i) 0 for d i N implies

326

Further Topics from Coding Theory

B j (X ) f ( j) 0. Using the LP bound (3.4.45) for k = 0 obtain Ki (0)

j=d

B j (X )Ki ( j). Hence,

j=d

f (0) = 1 + f j K j (0)
j=1
N

k=1
N

i=d

1 fk Bi (X )Kk (i)
N

= 1 Bi (X ) fk Kk (i)
i=d
N

k=1

= 1 Bi (X )( f (i) 1)
i=d
N

1 + Bi (X )
i=d

= M = Mq (N, d).
To obtain the Singleton bound select
Nd+1

f (x) = q

j=d

x
1
.
j

Then by the identity

i=0

N i
N j

Ki (k) = q

N k
j

with j = d 1 we have that

fk =

1
qN

f (i)Ki (k)

i=0
d1

N
N i
(k)/
K
N d +1 i
d 1
qd1 i=0

N k
N
=
/
0.
d 1
d 1
1

Here we use the identity

k=0

N k
N j

Kk (x) = q

N x
j

(3.4.47)

Clearly, f (i) = 0 for d i N. Hence, Rq (N, d) f (0) = qNd+1 . In a similar

manner Hammings and Plotkins bounds may be derived as well, cf. [97].

3.4 The MacWilliams identity and the linear programming bound

327

Worked Example 3.4.21

Using the linear programming bound, prove that
M2 (13, 5) = M2 (14, 6) 64. Compare it with the Elias bound. [Hint: E6 = 42,
E8 = 7, E10 = 14, E12 = E14 = 0. You may need a computer to get the solution.]
Solution The LP bound for linear codes reads
M2 (N, d) = max

0iN

subject to Ei 0, E0 = 1, E j = 0 for 1 j < d,

Ei = 0 for j odd, and

A B
N
N
+ Ei Kk (i) 0 for k = 1, . . . ,
.
k
2
diN
i even

For N = 14, d = 6, the constraints are

E0 = 1, E1 = E2 = E3 = E4 = E5 = E7 = E9 = E11 = E13 = 0,
E6 , E8 , E10 , E12 , E14 0,
14 + 2E6 2E8 6E10 10E12 14E14 0,
91 5E6 5E8 + 11E10 + 43E12 + 91E14 0,
364 12E6 + 12E8 + 4E10 100E12 364E14 0,
1001 + 9E6 + 9E8 39E10 + 121E12 + 1001E14 0,
2002 + 30E6 30E8 + 38E10 22E12 2002E14 0,
3003 5E6 5E8 + 27E10 165E12 + 3003E14 0,
3432 40E6 + 40E8 72E10 + 264E12 3432E14 0,
and the maximiser of the objective function S = E6 + E8 + E10 + E12 + E14 is
E6 = 42, E8 = 7, E10 = 14, E12 = E14 = 0,
with S = 63, E0 + S = 1 + 63 = 64. So, the LP bound yields
M2 (13, 5) = M2 (14, 6) 64.
Note that the bound is sharp as a [13, 64, 5] binary code actually exists. Compare
the LP bound with the Hamming bound:

M2 (13, 5) 213 (1 + 13 + 13 6) = 213 92 = 211 /23,
i.e.
M2 (13, 5) 91.
Next, the Singleton bound gives k 13 5 1 = 7,
M2 (13, 5) 27 = 128.

328

Further Topics from Coding Theory

It is also interesting to see what the Elias bound gives:

M2 (13, 5)

65/2
2
s 13s + 65/2

213

13
1 + 13 + +
5
for all s < 13 such that s2 13s + 65/2 > 0.

Substituting s = 2 yields s2 13s + 65/2 = 4 26 + 65/2 = 21/2 > 0 and

M2 (13, 5)

65 13
2
1 + 13 + 13 6 = 2.33277 106 :
21

not good enough. Next, s = 3 yields s2 13s + 65/2 = 9 39 + 65/2 = 5/2 > 0
and
M2 (13, 5)

13 11
65 13
212
2

2 :
1 + 13 + 13 6 + 13 2 5 = 13
5
111 66

not as good as Hammings. Finally, observe that 42 13 4 + 65/2 < 0, and the
procedure stops.

3.5 Asymptotically good codes

In this section we briefly discuss some families of codes where the number of
corrected errors gives a non-zero fraction of the codeword length. For more details,
see [54], [71], [131].
Definition 3.5.1 A sequence of [Ni , ki , di ] codes with Ni is said to be asymptotically good if ki /Ni and di /Ni are both bounded away from 0.
Theorem 3.3.25 showed that there is no asymptotically good sequence of primitive BCH codes (in fact, there is no asymptotically good sequence of BCH codes of
any form). Theoretically, an elegant way to produce an asymptotically good family
are the so-called Justensen codes. As the first attempt to define a good code take
0 = F2m Fm
2 and define the set
X = {(a, a) : a Fm
2 }.

(3.5.1)

Then X is a [2m, m] linear code and has information rate 1/2. We can recover
from any non-zero codeword (a, b) X , as = ba1 (division in F2m ). Hence,
if = then X X = {0}.
Now, given = m (0, 1/2], we want to find = m such that code X has
minimum weight 2m . Since a non-zero binary (2m)-word can enter at most
one of the X s, we can find such if the number of the non-zero (2m)-words

3.5 Asymptotically good codes

329

of weight < 2m is < 2m 1, the number of distinct codes X . That is, we can
manage if

2m
< 2m 1

i
1i2m 1

2m
or even better,
< 2m 1. Now use the following:
i
1i2m
Lemma 3.5.2

For 0 1/2,

N
2N ( ) ,

k
0k N

(3.5.2)

where ( ) is the binary entropy.

Proof Observe that (3.5.2) holds for = 0 (here both sides are equal to 1) and
= 1/2 (where the RHS equals 2N ). So we will assume that 0 < < 1/2. Consider
a random variable with the binomial distribution

N
(1/2)N , 0 k N.
P = k) =
k
Given t R+ , use the following Chebyshev-type inequality:
N
1
N
k 2 = P( N)
0k N

= P exp (t ) e Nt
e Nt Eet

1 t N
Nt 1
+ e
=e
.
2 2

(3.5.3)

Minimise the RHS of (3.5.3) in x = et for t > 0, i.e. for 0 < x < 1. This yields the
minimiser et = /(1 ) and the minimal value

N N
N
1

1+
1
2 1
1 N
1 N
= N N
= 2N ( )
,
2
2
with = 1 . Hence, (3.5.2) implies
N
N
1
N
N ( ) 1
.
k 2 2
2
0k N

330

Further Topics from Coding Theory

Owing to Lemma 3.5.2, inequality (3.5.1) occurs when

22m ( ) < 2m 1.
Now if, for example,

= m =

1
1/2
log m

(3.5.4)

(with 0 < < 12 ), bound (3.5.4) becomes 2m2m/ log m < 2m 1 which is true for
m large enough. And m h1 (1/2) > 0, as m . Here and below, 1 is
the inverse function to (0, 1/2] ( ). In the code (3.5.1) with a fixed
the information rate is 1/2 but one cannot guarantee that d/2m is bounded away
from 0. Moreover, there is no effective way of finding a proper = m . However,
in 1972, Justensen [81] showed how to obtain a good sequence of codes cleverly
using the concatenation of words from an RS code.
More precisely, consider a binary (k1 k2 )-word a organised as k1 separate k2 words: a = a(0) a(1) . . . a(k1 1) . Pictorially,
k2

a =

...
a(k1 1)

a(0)

a(i) F2k2 , 0 i k1 1.

We fix an [N1 , k1 , d1 ] code X1 over F2k2 called an outer code: X1 FN2k12 . Then
string a is encoded into a codeword c = c0 c1 . . . cN1 1 X1 . Next, each ci F2k2
is encoded by a codeword bi from an [N2 , k2 , d2 ] code X2 over F2 , called an inner
code. The result is a string b = b(0) . . . b(N1 1) FNq 1 N2 of length N1 N2 :
N2

b =

...
b(0)

b(N1 1)

b(i) F2N2 , 0 i N1 1.

The encoding is represented by the diagram:

input: a (k1 k2 ) string a,

output: an (N1 N2 ) codeword b.

Observe that different symbols ci can be encoded by means of different inner

codes. Let the outer code X1 be a [2m 1, k, d] RS code X RS over F2m . Write a
binary (k2m )-word a as a concatenation a(0) . . . a(k1) , with a(i) F2m . Encoding a
using X RS gives a codeword c = c0 . . . cN1 , with N = 2m 1 and ci F2m . Let
be a primitive element in F2m . Then for all j = 0, . . . , N 1 = 2m 2, consider the
inner code
*
+
(3.5.5)
X ( j) = (c, j c) : c F2m .

3.5 Asymptotically good codes

331

The resulting codeword (a super-codeword) is

b = (c0 , c0 )(c1 , c1 )(c2 , 2 c2 ) . . . (cN1 , N1 cN1 ).

(3.5.6)

Ju is the collection of binary super-words

Definition 3.5.3 The Justensen code Xm,k
m
b obtained as above, with the [2 1, k, d] RS code as an outer code X1 , and
Ju has length
X ( j) (see (3.5.6)) as the inner codes, where 0 j 2m 2. Code Xm,k
k
< 1/2.
2m(2m 1), rank mk and hence rate
m
2(2 1)
Ju is N = 2m 1, the length of the outer
A convenient parameter describing Xm,k
Ju with N , but k/(2m(2m
RS code. We want to construct a sequence Xm,k
1)) and d/(2m(2m 1)) bounded away from 0. Fix R0 (0, 1/2) and choose a
sequence of outer RS codes XNRS of length N, with N = 2m 1 and k = [2NR0 ].
Ju is k/(2N) R .
Then the rate of Xm,k
0
Now consider the minimum weight

Ju

Ju
Ju
, x = 0 = d Xm,k
= min w(x) : x Xm,k
.
(3.5.7)
w Xm,k

For any fixed m, if the outer RS code XNRS , N = 2m 1, has minimum weight
Ju has
d then any super-codeword b = (c0 , c0 )(c1 , c1 ) . . . (cN1 , N1 cN1 ) Xm,k
d non-zero first components c0 , . . . , cN1 . Furthermore, any two inner codes
among X (0) , X (1) , . . . , X (N1) have only 0 in common. So, the corresponding d
ordered pairs, being from different codes, must be distinct. That is, super-codeword
b has d distinct non-zero binary (2m)-strings.
Ju is at least the sum of the weights
Next, the weight of super-codeword b Xm,k
of the above d distinct non-zero binary (2m)-strings. So, we need to establish a
lower bound on such a sum. Note that

k1
N(1 2R0 ).
d = N k+1 = N 1
N
Ju has at least N(1 2R ) distinct non-zero binary
Hence, a super-codeword b Xm,k
0
(2m)-strings.

Lemma 3.5.4 The sum of the weights of any N(1 2R0 ) distinct non-zero binary
(2m)-strings is

1 1
o(1) .
(3.5.8)
2mN(1 2R0 )
2
Proof By Lemma 3.5.2, for any [0, 1/2], the number of non-zero binary (2m)strings of weight 2m is

2m

22m ( ) .
i
1i2m

332

Further Topics from Coding Theory

Discarding these lightweight strings, the total weight is

2m ( )
2m N(1 2R0 ) 2
= 2mN (1 2R0 ) 1

2m ( )
N(1 2R0 )

1
1

Select m =
(0, 1/2). Then m 1 (1/2), as h1 is
2 log(2m)
continuous on [0, 1/2]. So,

1
1
1

= 1
o(1).
m = 1
2 log(2m)
2
1

Since N = 2m 1, we have that as m , N , and

1
2m ( )
=
N(1 2R0 ) 1 2R0
1
=
1 2R0

2m2m/ log(2m)
2m 1
m
1
2
0.
m
2m/
log(2m)
2 1 2

So the total weight of the N(1 2R0 ) distinct (2m)-strings is bounded below by

2mN(1 2R0 ) 1 (1/2) o(1) (1 o(1)) = 2mN(1 2R0 ) 1 (1/2) o(1) .
Thus the result follows.
Ju has
Lemma 3.5.4 demonstrates that Xm,k

Ju
1 1
o(1) .
w Xm,k 2mN(1 2R0 )
2

Then

(3.5.9)

Ju
w Xm,k

(1 2R0 )( 1 (1/2) o(1)) (1 2R0 ) 1 (1/2)
Ju
length Xm,k
c(1 2R0 ) > 0.

Ju with k = [2NR ], N = 2m 1 and a fixed 0 < R < 1/2, has

So, the sequence Xm,k
0
0
information rate R0 > 0 and

Ju
w Xm,k

c(1 2R0 ) > 0, c = 1 (1/2) > 0.3.
(3.5.10)
Ju
length Xm,k

In the construction, R0 (0, 1/2). However, by truncating one can achieve any
given rate R0 (0, 1); see [110].
The next family of codes to be discussed in this section is formed by alternant
codes. Alternant codes are a generalisation of BCH (though in general not cyclic).
Like Justesen codes, alternant codes also form an asymptotically good family.

3.5 Asymptotically good codes

333

Let M be a (r n) matrix over field Fqm :

c11 . . .
..
..
M= .
.
cr1

c1n
.. .
.
. . . crn

m
As before, each ci j can be written as
c
i j (Fq ) , a column vector of length m over
Fq . That is, we can think of M as an (mr n) matrix over Fq (denoted again by M).
Given elements a1 , . . . , an Fqm , we have

a j ci j
c11 . . . c1n
a1
a1
1 jn

..
.. ..

.
..
.
.
.
M . = .
.
. . . =
.

an
cr1 . . . crn
an
a j cr j
1 jn

c =
c +
Furthermore, if b Fq and c, d Fqm then b
bc and
d = (c + d). Thus,
if a1 , . . . , an Fq , then

ai ci j

c11 . . . c1n
a1
a1
1 jn

.. ..

.
.
..
.
. . .. .. =
M . = .

.
.

an
an
cr1 . . . crn
ai cr j
1 jn

So, if the columns of M are linearly independent as r-vectors over Fqm , they are
also linearly independent as (rm)-vectors over Fq . That is, the columns of M are
linearly independent over Fq .
Recall that if is a primitive (n, Fqm ) root of unity and 2 then the n (m )
Vandermonde matrix over Fq

e
...

2
...
1

T
2
4
2(

H =

...

n1 2(n1) . . . ( 1)(n1)
BCH (a proper parity-check matrix emerges
checks a narrow-sense BCH code Xq,n,
,
after column purging). Generalise it by taking an n r matrix over Fqm

h1 h1 1 . . . h1 1r2 h1 1r1
h2 h2 2 . . . h2 r2 h2 r1
2
2

(3.5.11)
A= .
,
..
..
.
.
.
.

.
.
.

hn hn n . . . hn nr2 hn nr1

334

Further Topics from Coding Theory

or its n (mr) version over Fq :

h1 h1 1

h2 h2 2
A = .
..
.
..

hn hn n

. . . h1 1 r2

. . . h2 2 r2
..
..
.
.
r2
. . . hn n

r1
h1 1
r1

h2 2
.

r1
hn n

(3.5.12)

Here r < n, h1 , . . . , hn are non-zero elements and 1 , . . . , n are distinct elements

from Fq .
Note that any r rows of A in (3.5.11) form a square sub-matrix K that is similar
to Vandermondes. It has a non-zero determinant and hence any r rows of A are linearly independent over Fqm and hence over Fq . Also the columns of A in (3.5.11)

are linearly independent over Fqm . However, columns of A in (3.5.12) can be linearly dependent and purging such columns may be required to produce a genuine
parity-check matrix H.
Definition 3.5.5 Let = (1 , . . . , n ) and h = (h1 , . . . , hn ) where 1 , . . . , n are
distinct and h1 , . . . , hn non-zero elements of Fqm . Given r < n, an alternant code
XAlt
,h is the kernel of the n (rm) matrix A in (3.5.12).

has length n, rank k satisfying n mr k n r and

Theorem 3.5.6
XAlt
,hAlt
minimum distance d X ,h r + 1.
We see that the alternant codes are indeed generalisations of BCH. The main
outcome of the theory of alternant codes is the following Theorem 3.5.7 (not to be
proven here).
Theorem 3.5.7 There exist arbitrarily long alternant codes XAlt
,h meeting the
GilbertVarshamov bound.
So, alternant codes are asymptotically good. More precisely, a sequence of
asymptotically good alternant codes is formed by the so-called Goppa codes. See
below.
The Goppa codes are particular examples of alternant codes. They were invented
by a Russian coding theorist, Valery Goppa, in 1972 by following an elegant idea
that has its origin in algebraic geometry. Here, we perform the construction by
using methods developed in this section.
Let G(X) Fqm [X] be a polynomial over Fqm and consider Fqm [X]/"G(X)#, the
polynomial ring mod G(X) over Fqm . Then Fqm [X]/"G(X)# is a field iff G(X) is
irreducible. But if, for a given Fqm , G( ) = 0, the linear polynomial X is
invertible in Fqm [X]/"G(X)#. In fact, write
G(X) = q(X)(X ) + G( ),

(3.5.13)

3.5 Asymptotically good codes

335

with q(X) Fq [X], deg q(X) = deg G(X) 1.

So, q(X)(X ) = G( ) mod G(X) or
(G( )1 q(X))(X ) = e mod G(X)
and
(X )1 = (G( )1 q(X)) mod G(X).

(3.5.14a)

As q(X) = (G(X) G( ))(X )1 , we have that

(X )1 = (G(X) G( ))(X )1 G( )1 mod G(X).

(3.5.14b)

So we define (X )1 as a polynomial in Fqm [X]/"G(X)# given by (3.5.14a).

Definition 3.5.8 Fix a polynomial G(X) Fq [X] and a set = {1 , . . ., n } of
distinct elements of Fqm , qm n > deg G(X), where G( j ) = 0, 1 j n. Given
a word b = b1 . . . bn with bi Fq , 1 i n, set
Rb (X) =

bi (X i )1 Fqm [X]/"G(X)#.

(3.5.15)

1in

The q-ary Goppa code X Go (= XGo

,G ) is defined as the set
{b Fnq : Rb (X) = 0 mod G(X)}.

(3.5.16)

Clearly, XGo
,G is a linear code. The polynomial G(X) is called the Goppa polynomial; if G(X) is irreducible, we say that X Go is irreducible.
So, b = b1 . . . bn X Go iff in Fqm [X]

bi (G(X) G(i ))(X i )1 G(i )1 = 0.

(3.5.17)

1in

Write G(X) = gi X i where deg G(X) = r, gr = 1 and r < n. Then in Fqm [X]
0ir

(G(X) G(i ))(X i )1

= g j X j ij (X i )1
0 jr

= 0 jr g j 0u j1 X u ij1u

336

Further Topics from Coding Theory

and so
bi (G(X) G(i ))(X i )1 G(i )1

1in

= bi g j
1in

0 jr

0ur1

0u j1

ij1u X u G( j )1

X u bi G(i )1
1in

u+1 jr

g j ij1u .

Hence, b X Go iff in Fqm

1in

bi G(i )1

g j ij1u = 0

(3.5.18)

u+1 jr

for all u = 0, . . . , r 1.
Equation (3.5.18) leads to the parity-check matrix for X Go . First, we see that
the matrix

G(2 )1
...
G(n )1
G(1 )1
1 G(1 )1
2 G(2 )1 . . . n G(n )1

2 G(1 )1
22 G(2 )1 . . . n2 G(n )1
(3.5.19)
1

.
.
.
.
.
.
.
.

.
.
.
.
r1
r1
1
1
r1
1
1 G(1 )
2 G(2 )
. . . n G(n )
which is (n r) over Fm
q , provides a parity-check. As before, any r rows of matrix (3.5.19) are linearly independent over Fqm and so are its columns. Then again
we write (3.5.19) as an n (mr) matrix over Fq ; after purging linearly dependent
columns it will give the parity-check matrix H.
We see that X Go is an alternant code XAlt
,h where = (1 , . . . , n ) and h =
1
1
(G(1 ) , . . . , G(n ) ). So, Theorem 3.5.6 implies
Theorem 3.5.9 The q-ary Goppa code X = XGo
,G , where = {1 , . . . , n } and
deg G(X) = r < n, has length n, rank k satisfying n mr k n r and minimum
distance d(X ) r + 1.
As before, the above bound on minimum distance can be improved in the binary
case. Suppose that a binary word b = b1 . . . bn X where X is a Goppa code
XGo
,G , where F2m and G(X) F2 [X]. Suppose w(b) = w and bi1 = = biw = 1.
Take fb (X) = (X i j ) and write the derivative X fb (X) as
1 jw

where Rb (X) =

X fb (X) = Rb (X) fb (X)

(3.5.20)

1
(cf. (3.5.15)). As polynomials fb (X) and Rb (X)
X i j

1 jw

have no common roots in any extension F2K , they are co-prime. Then b X Go

3.5 Asymptotically good codes

337

iff G(X) divides Rb (X) which is the same as G(X) divides X fb (X). For q = 2,
X fb (X) has only even powers of X (as its monomials are of the form X 1 times
a product of some i j s: this vanishes when is even). In other words, X fb =
h(X 2 ) = (h(X))2 for some polynomial h(X). Hence if g(X) is the polynomial of
lowest degree which is a square and divisible by G(X) then G(X) divides X fb (X)
iff g(X) divides X fb (X). So,
b X Go g(X)|X fb (X) Rb (X) = 0 mod g(X).

(3.5.21)

Theorem 3.5.10 Let X be a binary Goppa code XGo

,G . If g(X) is a polynomial
of the lowest degree which is a square and divisible by G(X) then X = XGo
,g .
Hence, d(X Go ) deg g(X) + 1.
Corollary 3.5.11 Suppose that the Goppa polynomial G(X) F2 [X] has no mulGo
tiple roots in any extension field. Then XGo
,G = X ,G2 , and the minimum distance
Go
d X ,G is 2 deg G(X) + 1. Therefore, XGo
,G can correct deg G(X) errors.
A binary Goppa code XGo
,G where polynomial G(X) has no multiple roots is
called separable.
It is interesting to discuss a particular decoding procedure applicable for alternant codes and based on the Euclid algorithm; cf. Section 2.5.
The initial setup for decoding an alternant code XAlt
,h over Fq is as follows. As

i1
over Fq obtained from the
in (3.5.12), we take the n (mr) matrix A = h j j

n r matrix A = h j i1
over Fqm by replacing the entries with rows of length m.
j

Then purge linearly dependent columns from A . Recall that h1 , . . . , hn are non-zero
and 1 , . . . , n are distinct elements of Fqm . Suppose a word u = c + e is received,
where c is the right codeword and e an error vector. We assume that r is even and
that t r/2 errors have occurred, at digits 1 i1 < < it n. Let the i j th entry
of e be ei j = 0. It is convenient to identify the error locators with elements i j : as
i = i for i = i (the i are distinct), we will know the erroneous positions if we
determine i1 , . . . , it . Moreover, if we introduce the error locator polynomial
t

(X) = (1 i j X) =
j=1

i X i ,

(3.5.22)

0it

, then it will be enough to find (X) (i.e. coeffiwith 0 = 1 and the roots i1
j
cients i ).
We have to calculate the syndrome (we will call it an A-syndrome) emerging by
acting on matrix A:
uA = eA = 0 . . . 0ei1 . . . eit 0 . . . 0 A.

338

Further Topics from Coding Theory

Suppose the A-syndrome is s = s0 . . . sr1 , with s(X) =

to introduce the error evaluator polynomial (X), by

(X) =

(1 i j X).

(3.5.23)

1 jt: j=k

1kt

Lemma 3.5.12

hik eik

si X i . It is convenient

0ir1

For all u = 1, . . . ,t ,
ei j =

(i1
)
j

hi j

1 jt, j= j

(1 i j i1
)
j

(3.5.24)

Proof Straightforward.
The crucial fact is that (X), (X) and s(X) are related by
Lemma 3.5.13

The following formula holds true:

(X) = (X)s(X) mod X r .

(3.5.25)

Proof Write the following sequence:

(X) (X)s(X) =

hik eik

1 jt: j=k

1kt

= hik eik
k

1 jt: j=k

(1 i j X) (X)

sl X l

0lr1

(1 i j X) (X)

hik ilk eik X l

l 1kt

= hik eik (1 i j X) (X) hik eik ilk X l

j=k

= hik eik
k

(1 i X) (X) il X l
j

j=k

= hik eik (1 i j X)(1 (1 eik X) ilk X l )

j=k

1 irk X r
= hik eik (1 i j X) 1 (1 ik X)
1 ik X
k
j=k
= hik ik (1 i j X) irk X r = 0 mod X r .
k

j=k

Lemma 3.5.13 shows the way of decoding alternant codes. We know that there
exists a polynomial q(X) such that

(X) = q(X)X r + (X)s(X).

(3.5.26)

We also have deg (X) t 1 < r/2, deg (X) = t r/2 and that (X) and (X)
are co-prime as they have no common roots in any extension. Suppose we apply the

3.5 Asymptotically good codes

339

Euclid algorithm to the known polynomials f (X) = X r and g(X) = s(X) with the
aim to find (X) and (X). By Lemma 2.5.44, a typical step produces a remainder
rk (X) = ak (X)X r + bk (X)s(X).

(3.5.27)

If we want rk (X) and bk (X) to give (X) and (X), their degrees must match:
at least we must have deg rk (X) < r/2 and deg bk (X) r/2. So, the algorithm is
repeated until deg rk1 (X) r/2 and deg rk (X) < r/2. Then, according to Lemma
2.5.44, statement (3), deg bk (X) = deg X r deg rk1 (X) r r/2 = r/2. This is
possible as the algorithm can be iterated until rk (X) = gcd(X r , s(X)). But then
rk (X)| (X) and hence deg rk (X) deg (X) < r/2. So we can assume deg rk (X)
r/2, deg bk (X) r/2.
The relevant equations are

(X) = q(X)X r + (X)s(X),

deg (X) < r/2, deg (X) r/2,
gcd ( (X), (X)) = 1,
and also
rk (X) = ak (X)X r + bk (X)s(X), deg rk (X) < r/2, deg bk (X) r/2.
We want to show that polynomials rk (X) and bk (X) are scalar multiples of (X)
and (X). Exclude s(X) to get
bk (X) (X) rk (X)(X) = (bk (X)q(X) ak (X)(X))X r .
As
deg bk (X) (X) = deg bk (X) + deg (X) < r/2 + r/2 = r
and
deg rk (X)(X) = deg rk (X) + deg (X) < r/2 + r/2 = r,
deg(b(X) (X) rk (X)(X)) < r. Hence, bk (X) (X) rk (X)(X) must be 0, i.e.
(X)rk (X) = (X)bk (X), bk (X)q(X) = ak (X)(X).
So, (X)| (X)bk (X) and bk (X)|ak (X)(X). But (X) and (X) are co-primes as
well as ak (X) and bk (X) (by statement (5) of Lemma 2.5.44). Therefore, (X) =
bk (X) and hence (X) = rk (X). As l(0) = 1, = bk (0)1 .
To summarise:
Theorem 3.5.14 (The decoding algorithm for alternant codes) Suppose XAlt
,h is
an alternant code, with even r, and that t r/2 errors occurred in a received word
u. Then, upon receiving word u:

340

Further Topics from Coding Theory

(a) Find the A-syndrome uA = s0 . . . sr1 , with the corresponding polynomial

s(X) = l sl X l .
(b) Use the Euclid algorithm, beginning with f (X) = X r , g(X) = s(X), to obtain
rk (X) = ak (X)X r + bk (X)s(X) with deg rk1 (X) r/2 and deg rk (X) < r/2.
(c) Set (X) = bk (0)1 bk (X), (X) = bk (0)1 rk (X).

Then (X) is the error locator polynomial whose roots are the inverses of
i1 , . . . , yt = it , and i1 , . . . , it are the error digits. The values ei j are given by

ei j =

(i1
)
j
hi j l= j (1 il i1
)
j

(3.5.28)

The ideas discussed in this section found a far-reaching development in

algebraic-geometric codes. Algebraic geometry provided powerful tools in modern
code design; see [98], [99], [158], [160].

3.6 Additional problems for Chapter 3

Problem 3.1 Define ReedSolomon codes and prove that they are maximum distance separable. Prove that the dual of a ReedSolomon code is a ReedSolomon
code.
Find the minimum distance of a ReedSolomon code of length 15 and rank 11
and the generator polynomial g1 (X) over F16 for this code. Use the provided F16
field table to write g1 (X) in the form i0 + i1 X + i2 X 2 + , identifying each
coefficient as a single power of a primitive element of F16 .
Find the generator polynomial g2 (X) and the minimum distance of a Reed
Solomon code of length 10 and rank 6. Use the provided F11 field table to write
g2 (X) in the form a0 + a1 X + a2 X 2 + , where each coefficient is a number from
{0, 1, . . . , 10}.
Determine a two-error correcting ReedSolomon code over F16 and find its
length, rank and generator polynomial.
The field table for F11 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, with addition and multiplication mod 11:
i 0 1 2 3 4 5 6 7 8 9
i 1 2 4 8 5 10 9 7 3 6

3.6 Additional problems for Chapter 3

341

The field table for F16 = F24 :

i
0
1
2
3
4
5
6
7
8
i 0001 0010 0100 1000 0011 0110 1100 1011 0101
i
9
10
11
12
13
14
i
1010 0111 1110 1111 1101 1001
Solution A q-ary RS code X RS of designed distance q 1 is defined as a
cyclic code of length N = q 1 over Fq , with a generator polynomial
g(X) = (X b )(X b+1 ) . . . (X b+ 2 )
of deg(g(X)) = 1. Here is a primitive (q1, Fq ) root of unity (i.e. a primitive
element of Fq ) and b = 0, 1, . . . , q 2. The powers b , . . . , b+ 2 are called the
(defining) zeros and the remaining N + 1 powers of non-zeros of X RS .
The rank of X RS equals k = N + 1. The distance is = N k + 1, but
by the Singleton bound should be = N k + 1. So, the distance equals =
N k + 1, i.e. X RS is maximum distance separable.
The dual (X RS ) of X RS is cyclic and its zeros are the inverses of the nonzeroes of X RS :

q1 j = ( j )1 , j = b, . . . , b + 2.
That is, they are

qb , qb+1 , . . . , qb+q 1
and the generator polynomial g (X) for (X RS ) is

g (X) = (X b )(X b

) . . . (X b

+q 1

where b = q b. That is (X RS ) is an RS code of designed distance q + 1.

In the example, length 15 means q = 15 + 1 = 16 and rank 11 yields distance
= 15 11 + 1 = 5. The generator g1 (X) over F16 = F42 , for the code with b = 1,
reads
g1 (X) = (X )(X 2 )(X 3 )(X 4 )
= X 4 ( + 2 + 3 + 4 )X 3
+( 3 + 4 + 5 + 5 + 6 + 7 )X 2
( 6 + 7 + 8 + 9 )X + 10
= X 4 + 13 X 3 + 6 X 2 + 3 X + 10
where the calculation is accomplished by using the F16 field table.

342

Further Topics from Coding Theory

Similarly, length 10 means q = 11 and rank 6 yields distance = 10 6 + 1 = 5.

The generator g2 (X) is over F11 and, again for b = 1, reads
g2 (X) = (X )(X 2 )(X 3 )(X 4 )
= X 4 ( + 2 + 3 + 4 )X 3
+( 3 + 4 + 5 + 5 + 6 + 7 )X 2
( 6 + 7 + 8 + 9 )X + 10
= X 4 + 3X 3 + 5X 2 + 8X + 1
where the calculation is accomplished by using the F11 -field table.
Finally, a two-error correcting RS code over F16 must have length N = 15 and
distance = 5, hence rank 11. So, it coincides with the above 16-ary [15, 11] RS
code.
Problem 3.2 Let X be a binary linear [N, k] code and X ev be the set of words
x X of even weight. Prove that either

(i) X = X ev or
(ii) X ev is an [N, k 1] linear subcode of X .
Prove that if the generating matrix G of X has no zero column then the total weight
w(x) equals N2k1 .
xX

[Hint: Consider the contribution from each column of G.]

Denote by XH, the binary Hamming code of length N = 2 1 and by XH,

the dual simplex code, = 3, 4, . . .. Is it always true that the N -vector 1 . . . 1 (with
all digits one) is a codeword in XH, ? Let As and A
s denote the number of words

of weight s in XH, and XH, , respectively, with A0 = A

0 = 1 and A1 = A2 = 0.
Check that

A3 = N(N 1) 3!, A4 = N(N 1)(N 3) 4!,

and

A5 = N(N 1)(N 3)(N 7) 5!.

have weight 21 ). By

Prove that A
= 2 1 (i.e. all non-zero words x XH,
21
using the last fact and the MacWilliams identity for binary codes, give a formula
for As in terms of Ks (21 ), the value of the Kravchuk polynomial:

1
s21
2 1 21
2
Ks (21 ) =
(1) j .
1
j
s

j
j=0s+2 2 +1

Here 0 s + 21 2 + 1 = max [0, s + 21 2 + 1] and s 21 = min[s, 21 ].

Check that your formula gives the right answer for s = N = 2 1.

3.6 Additional problems for Chapter 3

343

Solution X ev is always a linear subcode of X . In fact, for binary words x and x ,

w(x + x ) = w(x) + w(x ) 2w(x x ), where digit (x x ) j = x j x j = min[xi , xi ].
If both x and x are even words (have even weight) or odd (have odd weight) then
x + x is even, and if x is even and x odd then x + x is odd. So, if X ev = X then
X ev is a subgroup in X of index [X ev : X ] two. Thus, there are two cosets, and
X ev is a half of X . So, X ev is [N, k 1] code.
Let g( j) = (g1 j , . . . , gk j )T be column j of G, the generating matrix of X . Set
W j = {i = 1, . . . , k : gi j = 1}, with W j = w(g( j) ) = w j 1, 1 j N. The contribution into the sum xX w(x) coming from g( j) equals
2kw j 2w j 1 = 2k1 .
Here 2kw j represents the number of subsets of the complement {1, . . . , k} \ W j and
2w j 1 the number of odd subsets of W j . Multiplying by N (the number of columns)
gives N2k1 .
If H = HH, is the parity-check matrix of the Hamming code XH, then weight
of row j of H equals the number of digits one in position j in the binary decomposition of numbers 1, . . . , 2l 1, 1 j l. So, w(h( j) ) = 2l1 (a half of numbers
1, . . . , 2l 1 have zero in position j and a half one). Then for all j = 1, . . . , N, the
dot-product 1 h( j) = w(h( j) ) mod 2 = 0, i.e. 1 . . . 1 XH, .
Now A3 = N(N 1)/3!, the number of linearly dependent triples of columns
of H (as the choice is made by fixing two distinct columns: the third is their
sum). Next, A4 = N(N 1)(N 3)/4!, the number of linearly dependent 4-ples of
columns of H (as the choice is made by fixing: (a) two arbitrary distinct columns,
(b) a third column that is not their sum, with (c) the fourth column being the sum
of the first three), and similarly, A5 = N(N 1)(N 3)(N 7)/5!, the number of
linearly dependent 5-ples of columns of H. (Here N 7 indicates that while choosing the fourth column we should avoid any of 23 1 = 7 linear combinations of
the first three.)
has w(x) = 2l1 . To prove
In fact, any non-zero word x in the dual code XH,
is H. So, write x as a sum of rows of
this note that the generating matrix of XH,
H, and let W be the set of rows of H contributing into this sum, with W = w .
Then w(x) equals the number of j among 1, 2, . . . , 2 1 such that in the binary
decomposition j = 20 j0 + 21 j1 + + 21 jl1 the sum tW jt mod 2 equals one.
As before, this is equal to 2w1 (the number of subsets of W of odd cardinality).
is 2 1 (2 1 l) = l
So, w(x) = 2w+w1 = 21 . Note that the rank of XH,l
= 2 .
and the size XH,
The MacWilliams identity (in the reversed form) reads
As =

XH,

Ai Ks (i)

i=1

(3.6.1)

344

Further Topics from Coding Theory

where
i N i
(1) j ,
Ks (i) =

j
s

j
j=0s+iN
si

(3.6.2)

with 0 s + i N = max[0, s + i N], s i = min[s, i]. In our case, A

0 = 1, A2 1 =
). Thus,
2 1 (the number of the non-zero words in XH,

1
1)K (21 )
1
+
(2
s
2

21 2 1 21
s21
1

j
(1) .
= 1 + (2 1)

j
2 1 j
2
j=0s+21 2 +1

As =

For s = N = 2 1, A2 1 can be either 1 (if the 2 -word 11 . . . 1 lies in XH, ) or 0

(if it doesnt). The last formula yields

21 21 2 1 21
1
A2 1 = 1 + (2 1)
j
2 1 j
2
j=21

1
= 1 + 2 1 = 1,
2
which agrees with the fact that 11 . . . 1 XH, .
Problem 3.3 Let be a root of M(X) = X 5 + X 2 + 1 in F32 ; given that M(X) is a
primitive polynomial for F32 , is a primitive (31, F32 )-root of unity. Use elements
, 2 , 3 , 4 to construct a binary narrow-sense primitive BCH code X of length
31 and designed distance 5. Identify the cyclotomic coset {i, 2i, . . . , 2d1 i} for each
of , 2 , 3 , 4 . Check that and 3 suffice as defining zeros of X and that the
actual minimum distance of X equals 5. Show that the generator polynomial g(X)
for X is the product
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1.

Suppose you received a word u(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 2 + 1 from a

sender who uses code X . Check that u( ) = 3 and u( 3 ) = 9 , argue that u(X)
should be decoded as
c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1

and verify that c(X) is indeed a codeword in X .

The field table for F32 = F25 and the list of irreducible polynomials of degree 5
over F2 are also provided to help with your calculations.

3.6 Additional problems for Chapter 3

345

The field table for F32 = F25 :

i
0
1
2
3
4
5
6
7
i
00001 00010 00100 01000 10000 00101 01010 10100
i
8
9
10
11
12
13
14
15
i
01101 11010 10001 00111 01110 11100 11101 11111
i
16
17
18
19
20
21
22
23
i 11011 10011 00011 00110 01100 11000 10101 01111
i
24
25
26
27
28
29
30
i 11110 11001 10111 01011 10110 01001 10010

The list of irreducible polynomials of degree 5 over F2 :

X 5 + X 2 + 1, X 5 + X 3 + 1, X 5 + X 3 + X 2 + X + 1,
X 5 + X 4 + X 3 + X + 1, X 5 + X 4 + X 3 + X 2 + 1;

they all have order 31. Polynomial X 5 + X 2 + 1 is primitive.

Solution As M(X) = X 5 + X 2 + 1 is a primitive polynomial in F2 [X], any root of
M(X) is a primitive (31, F2 )-root of unity, i.e. satisfies 31 + 1 = 0. Furthermore,
M(X) is the minimal polynomial for .
The BCH code X under construction is a cyclic code whose generator is a
polynomial of smallest degree having , 2 , 3 , 4 among its roots (i.e. a cyclic
code whose zeros form a minimal set including , 2 , 3 , 4 ). Thus, the generator
polynomial g(X) of X is the lcm of the minimal polynomials for , 2 , 3 , 4 .
The cyclotomic coset for is C = {1, 2, 4, 8, 16}; hence
(X )(X 2 )(X 4 )(X 8 )(X 16 ) = X 5 + X 2 + 1
is the minimal polynomial for , 2 and 4 . The cyclotomic coset for 3 is C =
{3, 6, 12, 24, 17} and the minimal polynomial M 3 (X) for 3 equals
M 3 (X) = (X 3 )(X 6 )(X 12 )(X 24 )(X 17 )
= X 5 + ( 3 + 6 + 12 + 24 + 17 )X 4 + ( 9 + 15 + 27
+ 20 + 18 + 30 + 23 + 36 + 29 + 41 )X 3
+( 21 + 33 + 26 + 39 + 32 + 44 + 42 + 35
+ 47 + 53 )X 2 + ( 45 + 38 + 50 + 56 + 59 )X + 62
= X5 + X4 + X3 + X2 + 1
by a direct field-table calculation or by inspecting the list of irreducible polynomials over F2 of degree 5.

346

Further Topics from Coding Theory

So, and 3 suffice as zeros, and the generating polynomial g(X) equals
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1,
as required. In other words:
X = {c(X) F2 [X]/(X 31 + 1) : c( ) = c( 3 ) = 0}
= {c(X) F2 [X]/(X 31 + 1) : g(X)|c(X)}.
The rank of X equals 21. The minimum distance of X equals 5, its designed
distance. This follows from Theorem
3.3.20:
N

E
Let N = 2 1. If 2 <
then the binary narrow-sense primitive BCH
0iE+1 i
code of designed distance 2E + 1 has minimum distance 2E + 1.
In fact, N = 31 = 25 1 with = 5 and E = 2, i.e. 2E + 1 = 5, and
1024 = 210 < 1 + 31 +

31 30 31 30 29
+
= 4992.
2
23

Thus, X corrects two errors. The BerlekampMassey decoding procedure requires calculating the values of the received polynomial at the defining zeros. From
the F32 field table we have
u( ) = 12 + 11 + 9 + 7 + 6 + 2 + 1 = 3 ,
u( 3 ) = 36 + 33 + 27 + 18 + 6 + 1 = 9 .
So, u( 3 ) = u( )3 . As u( ) = 3 , we conclude that a single error occurred, at
digit three, i.e. u(X) is decoded by
c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1
which is (X 2 + 1)g(X) as required.
Problem 3.4 Define the dual X of a linear [N, k] code of length N and dimension k with alphabet F. Prove or disprove that if X is a binary [N, (N 1)/2] code
with N odd then X is generated by a basis of X plus the word 1 . . . 1. Prove or
disprove that if a binary code X is self-dual, X = X , then N is even and the
word 1 . . . 1 belongs to X .
Prove that a binary self-dual linear [N, N/2] code X exists for each even N .
Conversely, prove that if a binary linear [N, k] code X is self-dual then k = N/2.
Give an example of a non-binary linear self-dual code.
Solution The dual X of the [N, k] linear code X is given by
X = {x = x1 . . . xN FN : x y = 0

for all y X }

3.6 Additional problems for Chapter 3

347

where x y = x1 y1 + + xN yN . Take N = 5, k = (N 1)/2 = 2,

1 0 0 0 0
0 1 0 0 0

X =
1 1 0 0 0 .
0 0 0 0 0
Then X is generated by

0 0 1 0 0
0 0 0 1 0 .
0 0 0 0 1

None of the vectors from X belongs to X , so the claim is false.

Now take a self-dual code X = X . If the word 1 = 1 . . . 1 X then there
exists x X such that x 1 = 0. But x 1 = xi = w(x) mod 2. On the other hand,
xi = x x, so x x = 0. But then x X . Hence 1 X . But then 1 1 = 0 which
implies that N is even.
Now let N = 2k. Divide digits 1, . . . , N into k disjoint pairs (1 , 1 ), . . . , (k , k ),
with i < i . Then consider k binary words x(1) , . . . , x(k) of length N and weight 2,
with the non-zero digits in the word x(i) in positions (i , i ). Then form the [N, k]
code generated by x(1) , . . . , x(k) .

This code X is self-dual. In fact, x(i) x(i ) = 0 for all i, i , hence X X .
Conversely, let y X . Then y x(i) = 0 for all i. This means that for all i, y
has either both 0 or both non-zero digits at positions (i , i ). Then y X . So,
X = X .
Now assume X = X . Then N is even. But the dimension must be k by the
rank-nullity theorem.
The non-binary linear self-dual code is the ternary Golay [12, 6] with a generating matrix

1 0 0 0 0 0 0 1 1 1 1 1
0 1 0 0 0 0 1 0 1 2 2 1

0 0 1 0 0 0 1 1 0 1 2 2
G=

0 0 0 1 0 0 1 2 1 0 1 2

0 0 0 0 1 0 1 2 2 1 0 1
0 0 0 0 0 1 1 1 2 2 1 0
Here rows of G are orthogonal (including self-orthogonal). Hence, X X .
But dim(X ) = dim(X ) = 6, so X = X .
Problem 3.5 Define a finite field Fq with q elements and prove that q must have
the form q = ps where p is a prime integer and s 1 a positive integer. Check that
p is the characteristic of Fq .

348

Further Topics from Coding Theory

Prove that for any p and s as above there exists a finite field Fsp with ps elements,
and this field is unique up to isomorphism.
Prove that the set Fps of the non-zero elements of F ps is a cyclic group Z ps 1 .
Write the field table for F9 , identifying the powers i of a primitive element
F9 as vectors over F3 . Indicate all vectors in this table such that 4 = e.
Solution A field Fq with q elements is a set of cardinality q with two commutative
group operations, + and , with standard distributivity rules. It is easy to check that
char(Fq ) = p is a prime number. Then F p Fq and q = Fq = ps where s = [Fq : F p ]
is the dimension of Fq as a vector space over F p , a field of p elements.
Now, let Fq , the multiplicative group of non-zero elements from Fq , contain an
element of order q 1 = Fq . In fact, every b Fq has a finite order ord(b) = r(b);
set r0 = max[r(b) : b Fq ]. and fix a Fq with r(a) = r0 . Then r(b)|r0 for all

b Fq . Next, pick , a prime factor of r(b), and write r(b) = s , r0 = s . Let us
s

s

check that s s . Indeed, a has order , b order s and a b order s . Thus,
if s > s, we obtain an element of order > r0 . Hence, s s which holds for any
prime factor of r(b), and r(b)|r(a).
Then br(a) = e, for all b Fq , i.e. the polynomial X r0 e is divisible by (X b).
It must then be the product bFq (X b). Then r0 = Fq = q 1. Then Fq is a
cyclic group with generator a.
For each prime p and positive integer s there exists at most one field Fq with
q = ps , up to isomorphism. Indeed, if Fq and F q are two such fields then they both
are isomorphic to Spl(X q X), the splitting field of X q X (over F p , the basic
field).
The elements of F9 = F3 F3 with 4 = e are e = 01, 2 = 1 + 2 = 21,
4 = 02, 6 = 2 + = 12 where = 10.
Problem 3.6 Give the definition of a cyclic code of length N with alphabet Fq .
(N, Fq )-roots
What are the defining zeros of a cyclic code
why are they always

and
3s 1 3s 1
,
s, 3 code is equivalent
of unity? Prove that the ternary Hamming
2
2
to a cyclic code and identify the defining zeros of this cyclic code.
A sender uses the ternary [13, 10, 3] Hamming code, with field alphabet F3 =
{0, 1, 2} and the parity-check matrix H of the form

1 0 1 2 0 1 2 0 1 2 0 1 2
0 1 1 1 0 0 0 1 1 1 2 2 2 .
0 0 0 0 1 1 1 1 1 1 1 1 1

The receiver receives the word x = 2 1 2 0 1 1 0 0 2 1 1 2 0. How should he

decode it?

3.6 Additional problems for Chapter 3

349

Solution As g(X)|(X N 1), all roots of

are Nth roots
of unity. Let gcd(l, q
g(X)
ql 1 ql 1
,
l code is equivalent to a
1) = 1. We prove that the Hamming
q1 q1

cyclic code, with defining zero = q1 where is the primitive ql 1 (q1)root of unity. Indeed, set N = (ql 1) (q1). The splitting
Spl(X N 1) = Fqr
field

s
l
where r = ordN (q) = min[s : N|(q 1)]. Then r = l as q 1 (q 1)|(ql 1)
and l is the least such power. So, Spl(X N 1) = Fql .
ql 1

If is a primitive element is Fql then = N = q1 is a primitive Nth root

of unity in Fql . Write 0 = e, , 2 , . . . , N1 as column vectors in Fq . . . Fq
and form an l N check matrix H. We want to check that any two distinct columns
of H are linearly independent. This is done exactly as in Theorem 3.3.14.
Then the code with parity-check matrix H has distance 3, rank k N l. The
Hamming bound with N = (ql 1)/(q 1)

q q
k

0mE

N
m

1
(q 1)

A
, with E =

B
d 1
,
2

(3.6.3)

shows that d = 3 and k = N l. So, the cyclic code with the parity-check matrix H
is equivalent to Hammings.
To decode the code in question, calculate the syndrome xH T = 2 0 2 = 2 (1 0 1)
indicating the error is in the 6th position. Hence, x 2e(6) = y + e(6) and the correct
word is c = 2 1 2 0 1 2 0 0 2 1 1 2 0.
Problem 3.7 Compute the rank and minimum distance of the cyclic code with
generator polynomial g(X) = X 3 +X +1 and parity-check polynomial h(X) = X 4 +
X 2 + X + 1. Now let be a root of g(X) in the field F8 . We receive the word
r(X) = X 5 + X 3 + X(mod X 7 1). Verify that r( ) = 4 , and hence decode r(X)
using minimum-distance decoding.
Solution A cyclic code X of length N has generator polynomial g(X) F2 [X]
and parity-check polynomial h(X) F2 [X] with g(X)h(X) = X N 1. Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + + ak X k where ak = 0, then
g(X), Xg(X), . . . , X nk1 g(X) form a basis for X . In particular, the rank of X
equals N k. In this question, k = 3 and rank(X ) = 4.

350

Further Topics from Coding Theory

If h(X) = b0 + b1 X + + bNk X Nk
X is

...
b1
bNk bNk1
0
b
.
..
b
Nk
Nk1

..
..
..
0
.
.
.

...

then the parity-check matrix H of code

b0
b1

0
b0

...
...

0
0

bNk bNk1 . . . b1 b0
;<
=
N

The codewords of X are linear dependence relations between the columns of H.

The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword.
In this question, N = 7, and

1 0 1 1 1 0 0
H = 0 1 0 1 1 1 0 .
0 0 1 0 1 1 1
no zero column
no codewords of weight 1
no repeated column no codewords of weight 2
Hence, d(X ) = 3. In fact, X is equivalent to Hammings original [7, 4] code.
Since g(X) F2 [X] is irreducible, the code X F72 = F2 [X] (X 7 1) is the
cyclic code defined by . The multiplicative cyclic group F8 Z
7 of non-zero
elements of field F8 is

0 = 1,
,
2,
3 = + 1,
4 = 2 + ,
5 = 3 + 2 = 2 + + 1,
6 = 3 + 2 + = 2 + 1,
7 = 3 + = 1.
Next, the value r( ) is
r( ) = + 3 + 5
= + ( + 1) + ( 2 + + 1)
= 2 +
= 4,

3.6 Additional problems for Chapter 3

351

as required. Let c(X) = r(X)+X 4 mod(X 7 1). Then c( ) = 0, i.e. c(X) is a codeword. Since d(X ) = 3 the code is 1-error correcting. We just found a codeword
c(X) at distance 1 from r(X). Then r(X) is written as
c(X) = X + X 3 + X 4 + X 5 mod (X 7 1),
and should be decoded by c(X) under minimum-distance decoding.
Problem 3.8 If X is a linear [N, k] code, define its weight enumeration polynomial WX (s,t). Show that:
(a)
(b)
(c)
(d)

WX (1, 1) = 2k ,
WX (0, 1) = 1,
WX (1, 0) has value 0 or 1,
WX (s,t) = WX (t, s) if and only if WX (1, 0) = 1.

Solution If x X the weight w(x) of X is given by w(x) = {xi : xi = 1}. Define

the weight enumeration polynomial
WX (s,t) = A j s j t N j

(3.6.4)

where A j = {x X : w(x) = j}. Then:

(a) WX (1, 1) = {x : x X } = 2dim X = 2k .
(b) WX (0, 1) = A0 = {0} = 1; note 0 X since X is a subspace.
(c) WX (1, 0) = 1 AN = 1, i.e. 11 . . . 1 X , WX (1, 0) = 0 AN = 0, i.e.
11 . . . 1 X .
(d) WX (s,t) = WX (t, s) W (0, 1) = W (1, 0) WX (1, 0) = 1 by (b).
So, if WX (1, 0) = 1 then
{x X : w(x) = j} = {x + 11 . . . 1 : x X , w(x) = j}
= {y X : w(y) = N j}
and WX (1, 0) = 1 implies AN j = A j for all j. Hence, WX (s,t) = WX (t, s).
Problem 3.9 State the MacWilliams identity, connecting the weight enumerator
polynomials of a code X and its dual X .
Prove that the weight enumerator of the binary Hamming code XH,l of length
N = 2l 1 equals

l
l
1
WX H (z) = l (1 + z)2 1 + (2l 1)(1 z2 )(2 2)/2 (1 z) .
(3.6.5)
l
2
Solution (The second part only) Let Ai be the number of codewords of weight i.
Consider i 1 columns of the parity-check matrix H. There are three possibilities:
(a) the sum of these columns is 0;

352

Further Topics from Coding Theory

(b) the sum of these columns is one of the chosen columns;

(c) the sum of these columns is one of the remaining columns.
Possibility (a) occurs Ai1 times; possibility (c) occurs iAi times as the selected
combination of i 1 columns may be obtained from any word of weight i by
dropping any of its non-zero components. Next, observe that possibility (b) occurs (N (i 2))Ai2 times. Indeed, this combination may be obtained from a
codeword of weight i 2 by adding any of the
N (i 2) remaining columns.
N
However, we can choose i 1 columns in i1
ways. Hence,

N
Ai1 (N i + 2)Ai2 ,
(3.6.6)
iAi =
i1
which is trivially correct if i > N + 1. If we multiply both sides by zi1 and then
sum over i we obtain an ODE
A (z) = (1 + z)N A(z) NzA(z) + z2 A (z).

(3.6.7)

Since A(0) = 1, the unique solution of this ODE is

N
1
(1 + z)N +
(1 + z)(N1)/2 (1 z)(N+1)/2
N +1
N +1
which is equivalent to (3.6.5).
A(z) =

(3.6.8)

Problem 3.10 Let X be a linear code over F2 of length N and rank k and let Ai be
the number of words in X of weight i, i = 0, . . . , N . Define the weight enumerator
polynomial of X as
W (X , z) =

Ai zi .

0iN

Let X denote the dual code to X . Show that

1z
.
W X , z = 2k (1 + z)N W X ,
1+z

(3.6.9)

[Hint: Consider g(u) = (1)uv zw(v) where w(v) denotes the weight of the
vFN
2

vector v and average over X .]

Hence or otherwise show that if X corrects at least one error then the words of
X have average weight N/2.
Apply (3.6.9) to the enumeration polynomial of Hamming code,
1
N
(1 + z)N +
(1 + z)(N1)/2 (1 z)(N+1)/2 ,
N +1
N +1
to obtain the enumeration polynomial of the simplex code:
W (XHam , z) =

W (Xsimp , z) = 2k 2N /2l + 2k (2l 1)/2l 2N z2

(3.6.10)

= 1 + (2l 1)z2 .

3.6 Additional problems for Chapter 3

353

Solution The dual code X , of a linear code X with the generating matrix G and
the parity-check matrix H, is defined as a linear code with the generating matrix
H. If X is an [N, k] code, X is an [N, N k] code, and the parity-check matrix
for X is G.
Equivalently, X is the code which is formed by the linear subspace in FN2
orthogonal to X in the dot-product

"x, y# =

xi yi , x = x1 . . . xN , y = y1 . . . yN .

1iN

By definition,
W (X , z) =

zw(u) , W X , z =

zw(v) .

Following the hint, consider the average

1
X

g(u), where g(u) = (1)"u,v# zw(v) .

(3.6.11)

Then write (3.6.11) as

1
X

zw(v) (1)"u,v#.
v

(3.6.12)

Note that when v X , the sum (1)"u,v# = X . On the other hand, when
uX

v X then there exists u0 X such that "u0 , v# = 0 (i.e. "u0 , v# = 1). Hence, if
v X , then, with the change of variables u u + u0 , we obtain

(1)"u,v# = (1)"u+u ,v#

(1)"u,v# = (1)"u,v#,

= (1)"u0 ,v#

which yields that in this case

(1)"u,v#

= 0. We conclude that the sum in

(3.6.11) equals
1
X

zw(v) X = W X , z .

(3.6.13)

On the other hand, for u = u1 . . . uN ,

g(u) =

zw(vi ) (1)ui vi

v1 ,...,vN 1iN

zw(a) (1)aui

1iN a=0,1

1iN

1 + z(1)ui .

(3.6.14)

354

Further Topics from Coding Theory

Here w(a) = 0 for a = 0 and w(a) = 1 for a = 1. The RHS of (3.6.14) equals
(1 z)w(u) (1 + z)Nw(u) .
Hence, an alternative expression for (3.6.11) is
1
(1 + z)N
X
uX

1z
1+z

w(u)
=

1
1z
(1 + z)N W X ,
.
X
1+z

Equating (3.6.13) and (3.6.15) yields

1z
1
N
(1 + z) W X ,
= W X , z
X
1+z

(3.6.15)

(3.6.16)

which gives the required equation as X = 2k .

Next, differentiate (3.6.16) in z at z = 1. The RHS gives

iAi X = X the average weight in X .
0iN

On the other hand, in the LHS we have

1
d
i
Ni
A
(X
)(1

z)
(1
+
z)

i

dz X 0iN
z=1

1
N2N1 A1 (X )2N1 (only terms i = 0, 1 contribute)
=
X
2N N
(A1 (X ) = 0 as the code is at least 1-error correcting,
=
X 2
with distance 3).
Now take into account that
( X ) ( X ) = 2k 2Nk = 2N .
The equality
the average weight in X =

N
2

follows. The enumeration polynomial of the simplex code is obtained by substitution. In this case the average length is (2l 1)/2.
Problem 3.11 Describe the binary narrow-sense BCH code X of length 15 and
the designed distance 5 and find the generator polynomial. Decode the message
100000111000100.

3.6 Additional problems for Chapter 3

355

Solution Take the binary narrow-sense BCH code X of length 15 and the designed
distance 5. We have Spl(X 15 1) = F24 = F16 . We know that X 4 + X + 1 is a
primitive polynomial over F16 . Let be a root of X 4 + X + 1. Then
M1 (X) = X 4 + X + 1, M3 (X) = X 4 + X 3 + X 2 + X + 1,
and the generator g(X) for X is
g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.
Take g(X) as example of a codeword. Introduce 2 errors at positions 4 and 12
by taking
u(X) = X 12 + X 8 + X 7 + X 6 + 1.
Using the field table for F16 , obtain
u1 = u( ) = 12 + 8 + 7 + 6 + 1 = 6
and
u3 = u( 3 ) = 36 + 24 + 18 + 1 = 9 + 3 + 1 = 4 .
As u1 = 0 and u31 = 18 = 3 = u3 , deduce that 2 errors occurred. Calculate the
locator polynomial
l(X) = 1 + 6 X + ( 13 + 12 )X 2 .
Substituting 1, , . . . , 14 into l(X), check that 3 and 11 are roots. This confirms
that, if exactly 2 errors occurred their positions are 4 and 12 then the codeword sent
was 100010111000000.
Problem 3.12 For a word x = x1 . . . xN FN2 the weight w(x) is the number
of non-zero digits: w(x) = {i : xi = 0}. For a linear [N, k] code X let Ai be the
number of words in X of weight i (0 i N). Define the weight enumerator
N

polynomial W (X , z) = Ai zi . Show that if we use X on a binary symmetric

i=0

356

Further Topics from Coding Theory

Problem 3.13 Let X be a binary linear [N, k, d] code, with the weight enumerator WX (s). Find expressions, in terms of WX (s), for the weight enumerators of:
(i) the subcode X ev X consisting of all codewords x X of even weight,
(ii) the parity-check extension X pc of X .

Prove that if d is even then there exists an [N, k, d] code where each codeword has
even weight.
Solution (i) All words with even weights from X belong to subcode X ev . Hence
ev
(s) =
WX

1
[WX (s) +WX (s)] .
2

(ii) Clearly, all non-zero coefficients of weight enumeration polynomial for X +

corresponds to even powers of z, and A2i (X + ) = A2i (X )+A2i1 (X ), i = 1, 2, . . ..
Hence,
1
pc
WX (s) = [(1 + s)WX (s) + (1 s)WX (s)] .
2
If X is binary [N, k, d] then you first truncate X to X then take the parity+
check extension (X ) . This preserves k and d (if d is even) and makes all codewords of even weight.
Problem 3.14 Check that polynomials X 4 + X 3 + X 2 + X + 1 and X 4 + X + 1 are
irreducible over F2 . Are these polynomials primitive over F2 ? What about polynomials X 3 + X + 1, X 3 + X 2 + 1? X 4 + X 3 + 1?
Solution As both polynomials X 4 + X 3 + X 2 + X + 1 and X 4 + X + 1 do not vanish
at X = 0 or X = 1, they are not divisible by X or X +1. They are also not divisible by
X 2 + X + 1, the only irreducible polynomial of degree 2, or by X 3 + X + 1 or X 3 +
X 2 + 1, the only irreducible polynomials of degree 3. Hence, they are irreducible.
The polynomial X 4 + X 3 + X 2 + X + 1 cannot be primitive polynomial as it divides X 5 1. Let us check that X 4 + X + 1 is primitive. Take F2 [X]/"X 4 + X + 1#
and use the F42 field table. The cyclotomic coset is { , 2 , 4 , 8 } (as 16 = ).
The primitive polynomial M (X) is then
(X )(X 2 )(X 4 )(X 8 )
= X 4 ( + 2 + 4 + 8 )X 2
+( 2 + 4 + 8 + 2 4 + 2 8 + 4 8 )X 2
( 2 4 + 2 8 + 4 8 + 2 4 8 )x + 2 4 8
= X 4 ( + 2 + 4 + 8 )X 2 + ( 3 5 + 9 + 6 + 10 + 12 )X 2
( 7 + 11 + 13 + 14 )X + 15 = X 4 + X + 1.

3.6 Additional problems for Chapter 3

357

The order of X 4 + X + 1 is 15: other primitive polynomials of order 15 are X 4 +

X 3 + 1 and X 4 + X + 1. Thus, the only primitive polynomial of degree 4 is X 4 +
X + 1. Similarly, the only primitive polynomials of degree 3 are X 3 + X + 1 and
X 3 + X 2 + 1, both of order 7.
Problem 3.15 Suppose a binary narrow-sense BCH code is used, of length 15,
designed distance 5, and the received word is X 10 + X 5 + X 4 + X + 1. How is it
decoded? If the received word is X 11 + X 10 + X 6 + X 5 + X 4 + X + 1, what is the
number of errors?
Solution Suppose the received word is
r(X) = X 10 + X 5 + X 4 + X + 1,
and let be a primitive element in F16 . Then
s1 = r( ) = 10 + 5 + 4 + + e
= 0111 + 0110 + 0011 + 0010 + 0001 = 0001 = e,
s3 = r( 3 ) = 30 + 15 + 12 + 3 + e
= 0001 + 0001 + 1111 + 1000 + 0001 = 0110 = 5 .
See that s3 = s31 : two errors. The error-locator polynomial
2
2
5
2
10 2
(X) = e + s1 X + (s3 s1
1 + s1 )X = e + X + ( + e)X = e + X + X .

Checking for the roots, 0 = e, 1 , 2 , 3 , 4 , 5 , 6 : no, 7 : yes. Then divide:

( 10 X 2 + X + e) (X + 7 ) = 10 X + 8 = 10 (X + 13 ),
and identify the second root: 13 . So, the errors occurred at positions 15 7 = 8
and 15 13 = 2. Decode:
r(X) X 10 + X 8 + X 5 + X 4 + X 2 + X + 1.

Problem 3.16 Prove that the binary code of length 23 generated by the polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimal distance 7, and is
perfect.
[Hint: If grev (X) = X 11 g(1/X) is the reversal of g(X) then
X 23 + 1 (X + 1)g(X)grev (X) mod 2.]
Solution First, show that the code is BCH, of designed distance 5. By the freshers
dream Lemma 3.1.5, if is a root of a polynomial f (X) F2 [X] then so is 2 .
Thus, if is a root of g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ,
2 , 4 , 8 , 16 , 9 , 18 , 13 , 3 , 6 , 12 . This yields the design sequence

358

Further Topics from Coding Theory

{ , 2 , 3 , 4 }. By the BCH bound (Theorem 2.5.39 and Theorem 3.2.9), the

cyclic code X generated by g(X) has distance 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by the concatenated words
(X i g(X)|1) and (X j g(X)|1).
Their dot-product equals
1 + (X i g(X))(X j g(X)) = 1 + gi+r g j+r
= 1 + gi+r grev
11 jr

= 1 + coefficient of X 11+i j in g(X) grev (X)

;<
=
:
||
1 + + X 22
= 1 + 1 = 0.
We conclude that
any two words in X + are dot-orthogonal.
Next, observe that all words in X + have weights divisible by 4. Indeed, by
inspection, all rows (X i g(X)|1) of the generating matrix of X + have weight 8.
Then, by induction on the number of rows involved in the sum, if x X + and
g(i) (X i g(X)|1) is a row of the generating matrix of X + then

(3.6.17)
w g(i) + x = w g(i) + w(x) 2w g(i) x ,

(i)
(i)

where g x l = min g l , xl , l = 1, . . . , 24. We know that 8 divides w g(i) .

Moreover,
by theinduction
hypothesis, 4 divides w(x). Next,by (3.6.17),
w g(i)

x is even, so 2w g(i) x is divisible by 4. Then the LHS, w g(i) + x , is divisible
by 4.
Therefore, the distance of X + is 8, as it is 5 and is divisible by 4. (It is easy to
see that it cannot be > 8 as then it would be 12.) Then the distance of the original
code, X , equals 7.
The code X is perfect 3-error correcting, since the volume of the 3-ball in F23
2
equals

23
23
23
23
+
+
+
= 1 + 23 + 253 + 1771 = 2048 = 211 ,
0
1
2
3
and
211 212 = 223 .
Here, obviously, 12 represents the rank and 23 the length.

3.6 Additional problems for Chapter 3

359

Problem 3.17 Use the MacWilliams identity to prove that the weight distribution
of a q-ary MDS code of distance d is

N
j i
id+1 j
q
Ai =
(1)

1
i 0
j
jid

N
j i1
qid j , d i N.
=
(q 1) (1)
j
i
0 jid
[Hint: To begin the solution,
(a)
(b)
(c)
(d)
(e)

write the standard MacWilliams identity,

swap X and X ,
change s s1 ,
multiply by sn and
take the derivative d r /dsr , 0 r k (which equals d(X ) 1).

Use the Leibniz rule

r j

j

d
r
d
dr
f (s)g(s) =
f (s)
g(s) .
dsr
ds j
dsr j
0 jr j

(3.6.18)

Use the fact that d(X ) = N k +1 and d(X ) = k +1 and obtain simplified equations involving ANk+1 , . . . , ANr only. Subsequently, determine ANk+1 , . . . , ANr .
Varying r, continue up to AN .]
Solution The MacWilliams identity is

1iN

i
A
i s =

Ni
1
i
A
(1

s)
.
1
+
(q

1)s
i

qk 1iN

Swap X and X , change s s1 and multiply by sN . After this differentiate

r k times and substitute s = 1:

1
1
N i
N i
Ai = r Ai
(3.6.19)

r
N r
qk 0iNr
q 0ir
(the Leibniz rule (3.6.18) is used here). Formula (3.6.19) is the starting point. For
an MDS code, A0 = A
0 = 1, and

Ai = 0, 1 i N k (= d 1), A
i = 0, 1 i k (= d 1).

Then

1 N
1 Nr
1
N i
N
N 1
Ai = r
= r
,
+

r
r qk qk i=Nk+1
q N r
q r

360

i.e.

Further Topics from Coding Theory

N
N i
(qkr 1).
Ai =

r
r
i=Nk+1
Nr

For r = k we obtain 0 = 0, for r = k 1

N
(q 1),
ANk+1 =
k1
for r = k 2

(3.6.20)

N
k1
+ ANk+2 =
(q2 1),
A
k2
k 2 Nk+1

etc. This is a triangular system of equations for ANk+1 , . . . , ANr . Varying r, we

can get ANk+1 , . . . , AN1 . The result is

i
N
j
(qid+1 j 1)
Ai =
(1)
j
i 0 jid

i1
N
j
(qqid j 1)
=
(1)
j
i
0 jid

i1
(1) j1

(qid+1 j 1)
j1
jid+1
1

N
i 1 id j
j
=
, d i N,
(q 1) (1)
q
i
j
0 jid
as required.
In fact, (3.6.20) can be obtained without calculations: in an MDS code of rank k
and distance d any k = N d + 1 digits determine the codeword uniquely. Further,
for any choice of N d positions there are exactly q codewords with digits 0 in
these positions. One of them is the zero codeword, and the remaining q 1 are of
weight d. Hence,

N
ANk+1 = Ad =
(q 1).
d

Prove the following properties of Kravchuks polynomials Kk (i).

N
N
i
k
Kk (i) = (q 1)
Ki (k).
(a) For all q: (q 1)
i
k
(b) For q = 2: Kk (i) = (1)k Kk (N i).
(c) For q = 2: Kk (2i) = KNk (2i).
Problem 3.18

3.6 Additional problems for Chapter 3

361

Solution Write

i
N i
Kk (i) =
(1) j (q 1)k j .

j
k

j
0(i+kN) jki

Next:
(a) The following straightforward equation holds true:

i N
k N
Kk (i) = (q 1)
Ki (k)
(q 1)
i
k
(as all summands become insensitive to swapping i k).

N
N
For q = 2 this yields
Ki (k); in particular,
Kk (i) =
k
i

N
N
K0 (i) =
Ki (0) = Ki (0).
i
0
(b) Also, for q = 2: Kk (i) = (1)k Kk (N i) (again straightforward, after swapping
i i j).

N
N
K2i (k) which equals
(c) Thus, still for q = 2:
K (2i) =
k
2i k

N
N
2i
(1)
K (2i). That is,
K2i (N k) =
2i Nk
N k
Kk (2i) = KNk (2i).

Problem 3.19 What is an (n, Fq )-root of unity? Show that the set E(n,q) of the
(n, Fq )-roots of unity form a cyclic group. Check that the order of E(n,q) equals n if
n and q are co-prime. Find the minimal s such that E(n,q) Fqs .
Define a primitive (n, Fq )-root of unity. Determine the number of primitive
(n, Fq )-roots of unity when n and q are co-prime. If is a primitive (n, Fq )-root of
unity, find the minimal such that Fq .
Find representation of all elements of F9 as vectors over F3 . Find all (4, F9 )-roots
of unity as vectors over F3 .
Solution We know that any root of an irreducible polynomial of degree 2 over field
F3 = {0, 1, 2} belongs to F9 . Take the polynomial f (X) = X 2 + 1 and denote its
root by (any of the two). Then all elements of F9 may be represented as a0 + a1
where a0 , a1 F3 . In fact,
F9 = {0, 1, , 1 + , 2 + , 2 , 1 + 2 , 2 + 2 }.

362

Further Topics from Coding Theory

Another approach is as follows: we know that X 8 1 = (X i ) in the field

1i8

F9 where is a primitive (8, F9 )-root of unity. In terms of circular polynomials,

X 8 1 = Q1 (X)Q2 (X)Q4 (X)Q8 (X). Here Qn (x) = s:gcd(s,n)=1 (x s ) where
is a primitive (n, F9 )-root of unity. Write X 8 1 = d:d|8 Qd (x). Next, compute
Q1 (X) = 1 + X, Q2 (X) = 1 + X, Q4 (X) = 1 + X 2 ,

Q8 (X) = X 8 1 Q1 (X)Q2 (X)Q4 (X) = (X 8 1)/(X 4 1) = X 4 + 1.
As 32 = 1 mod 8, by Theorem 3.1.53 Q8 (X) should be decomposed over F3 into
product of (8)/2 = 2 irreducible polynomials of degree 2. Indeed,
Q8 (X) = (X 2 + X + 2)(X 2 + 2X + 2).
Let be a root of X 2 + X + 2, then it is a primitive root of degree 8 over F3 and
F9 = F3 ( ). Hence, F9 = {0, , 2 , 3 , 4 , 5 , 6 , 7 , 8 }, and = 1 + . Finally,
we present the index table

= 1 + , 2 = 2 , 3 = 1 + 2 , 4 = 2,
5 = 2 + 2 , 6 = , 7 = 2 + , 8 = 1.
Hence, the roots of degree 4 are 2 , 4 , 6 , 8 .
Problem 3.20 Define a cyclic code of length N over the field Fq . Show that there
is a bijection between the cyclic codes of length N , and the factors of X N e in the
polynomial ring Fq [X].
Now consider binary cyclic codes. If N is an odd integer then we can find a finite
extension K of F2 that contains a primitive N th root of unity . Show that a cyclic
code of length N with defining set { , 2 , . . . , 1 } has minimum distance at
least . Show that if N = 2 1 and = 3 then we obtain the Hamming [2 1,
2 1 , 3] code.
Solution A linear code X FN
is a cyclic code if x1 . . . xN X implies that
q
x2 , . . . xN x1 X . Bijection of cyclic codes and factors of X N 1 can be established
as in Corollary 3.3.3.
Passing to binary codes, consider, for brevity, N = 7. Factorising in F72 renders
the decomposition
X 7 1 = (X 1)(X 3 + X + 1)(X 3 + X 2 + 1) := (X 1) f1 (X) f2 (X).
Suppose is a root of f1 (X). Since f1 (X)2 = f1 (X 2 ) in F2 [X] we have
f1 ( ) = f1 ( 2 ) = 0.
It follows that the cyclic code X with defining root has the generator polynomial
f1 (X) and the check polynomial (X 1) f2 (X) = X 4 + X 2 + X + 1. This property

3.6 Additional problems for Chapter 3

363

characterises Hammings original code (up to equivalence). The case where is a

root of f2 (X) is similar (in fact, we just reverse every codeword). For a general N =
2l 1, we take a primitive element F2l and its minimal polynomial M (X).
l1
The roots of M (X) are , 2 , . . . , 2 , hence deg M (X) = l. Thus, a code with
defining root has rank N l = 2l 1 l, as in the Hamming [2l 1, 2l l 1]
code.
Problem 3.21 Write an essay comparing the decoding procedures for Hamming
and two-error correcting BCH codes.
Solution To clarify the ideas behind the BCH construction, we first return to the
Hamming codes. The Hamming [2l 1, 2l 1 l] code is a perfect one-error correcting code of length N = 2l 1. The procedure of decoding the Hamming code is
as follows. Having a word y = y1 . . . yN , N = 2l 1, form the syndrome s = yH T .
If s = 0, decode y by y. If s = 0 then s is among the columns of H = HHam . If this is
column i, decode y by x = y + ei , where ei = 0 . . . 010 . . . 0 (1 in the ith position,
0 otherwise).
We can try the following idea to be able to correct more than one error (two to
start with). Select 2l of the rows of the parity-check matrix in the form

H
.
(3.6.21)
H=
H
Here HHam is obtained by permuting the columns of HHam ( is a permutation
of degree 2l 1). The new matrix H contains 2l linearly independent rows: it then
determines a [2l 1, 2l 1 2l] linear code. The syndromes are now words of
length 2l (or pairs of words of length l): yH T = (ss ). A syndrome (s, s )T may or
may not be among the columns of H. Recall, we want the new code to be two-error
correcting, and the decoding procedure to be similar to the one for the Hamming
codes. Suppose two errors occur, i.e. y differs from a codeword x by two digits, say
i and j. Then the syndrome is
yH T = (si + s j ,

si + s j )

where sk is the word representing column k in H. We organise our permutation so

that, knowing vector (si + s j , si + s j ), we can always find i and j (or equivalently,
si and s j ). In other words, we should be able to solve the equations
si + s j = z,

si + s j = z

(3.6.22)

for any pair (z, z ) that may eventually occur as a syndrome under two errors.
A natural guess is to try a permutation that has some algebraic significance,
e.g. si = si si = (si )2 (a bad choice) or si = si si si = (si )3 (a good choice)

364

Further Topics from Coding Theory

or, generally, si = si si si (k times). Say, one can try the multiplication

mod 1 + X N ; unfortunately, the multiplication does not lead to a field. The reason
is that polynomial 1 + X N is always reducible. So, suppose we organise the check
matrix as

(1 . . . 00) (1 . . . 00)k

..
HT =
.
.
(1 . . . 11) (1 . . . 11)k
Then we have to deal with equations of the type
si + s j = z, ski + skj = z .

(3.6.23)

For solving (3.6.23), we need the field structure of the Hamming space, i.e. not
only multiplication but also division. Any field structure on the Hamming space
of length
N is isomorphic to F2N , and a concrete realisation of such a structure is
F2 [X] "c(X)#, a polynomial field modulo an irreducible polynomial c(X) of degree
N. Such a polynomial always exists: it is one of the primitive polynomials of degree
N. In fact, the simplest consistent system of the form (3.6.23) is
s + s = z, s3 + s = z ;
3

it is reduced to a single equation zs2 z2 s + z3 z = 0, and our problem becomes

to factorise the polynomial zX 2 z2 X + z3 z .
For N = 2l 1, l = 4 we obtain [15, 7, 5] code. The rank 7 is due to the linear
independence of the columns of H. The key point is to check that the code corrects
up to two errors. First suppose we received a word y = y1 . . . y15 in which two
errors occurred in digits i and j that are unknown. In order to find these places,
calculate the syndrome yH T = (z, z )T . Recall that z and z are words of length 4;
the total length of the syndrome is 8. Note that z = z3 : if z = z3 , precisely one
error occurred. Write a pair of equations
s + s = z, s3 + s = z ,
3

(3.6.24)

where s and s are words of length 4 (or equivalently their polynomials), and the
multiplication is mod 1 + X + X 4 . In the case of two errors it is guaranteed that
there is exactly one pair of solutions to (3.6.24), one vector occupying position i
and another position j, among the columns of the upper (Hamming) half of matrix
H. Moreover, (3.6.24) cannot have more than one pair of solutions because
z = s3 + s = (s + s )(s2 + ss + s ) = z(z2 + ss )
3

implies that
ss = z z1 + z2 .

(3.6.25)

3.6 Additional problems for Chapter 3

365

Now (3.6.25) and the first equation in (3.6.24) give that s, s are precisely the roots
of a quadratic equation

X 2 + zX + z z1 + z2 = 0
(3.6.26)
(with z z1 + z2 = 0). But the polynomial in the LHS of (3.6.26) cannot have more
than two distinct roots (it could have no root or two coinciding roots, but it is
excluded by the assumption that there are precisely two errors). In the case of a
single error, we have z = z3 ; in this case s = z is the only root and we just find the
word z among the columns of the upper half of matrix H.
Summarising, the decoding scheme, in the case of the above [15, 7] code, is as
follows: Upon receiving word y, form a syndrome yH T = (z, z )T . Then
(i) If both z and z are zero words, conclude that no error occurred and decode y
by y itself.
(ii) If z = 0 and z3 = z , conclude that a single error occurred and find the location
of the error digit by identifying word z among the columns of the Hamming
check matrix.
(iii) If z = 0 and z3 = z , form the quadric (3.6.24), and if it has two distinct roots
s and s , conclude that two errors occurred and locate the error digits by identifying words s and s among the columns of the Hamming check matrix.
(iv) If z = 0 and z3 = z and quadric (3.6.26) has no roots, or if z is zero but z is
not, conclude that there are at least three errors.
Note that the case where z = 0, z3 = z and quadric (3.6.26) has a single root is
impossible: if (3.6.26) has a root, s say, then either another root s = s or z = 0 and
a single error occurs.
The decoding procedure allows us to detect, in some cases, that more than three
errors occurred. However, this procedure may lead to a wrong codeword when
three or more errors occur.

4
Further Topics from Information Theory

In Chapter 4 it will be convenient to work in a general setting which covers both

discrete and continuous-type probability distributions. To do this, we assume that
probability distributions under considerations are given by their RadonNikodym
derivatives with respect to underlying reference measures usually denoted by
or . The role of a reference measure can be played by a counting measure supported by a discrete set or by the Lebesgue measure on Rd ; we need only that
the reference measure is locally finite (i.e. it assigns finite values to compact sets).
The RadonNikodym derivatives will be called probability mass functions (PMFs):
they represent probabilities in the discrete case and probability density functions
(PDFs) in the continuous case.
The initial setting of the channel capacity theory developed for discrete channels
in Chapter 1 (see Section 1.4) goes almost unchanged for a continuously distributed
noise by adopting the logical scheme:
4
3
a set U of messages, of cardinality M = 2NR
a codebook X of size M with codewords of length N
reliable rate R of transmission through a noisy channel
the capacity of the channel.
However, to simplify the exposition, we will assume from now on that encoding
U X is a one-to-one map and identify a code with its codebook.

4.1 Gaussian channels and beyond

Here we study channels with continuously distributed noise; they are the basic
models in telecommunication, including both wireless and telephone transmission.
The most popular model of such a channel is a memoryless additive Gaussian channel (MAGC) but other continuous-noise models are also useful. The case of an
366

4.1 Gaussian channels and beyond

367

MAGC is particularly attractive because it allows one to do some handy and farreaching calculations with elegant answers.
However, Gaussian (and other continuously distributed) channels present a challenge that was absent in the case of finite alphabets considered in Chapter 1.
Namely, because codewords (or, using a slightly more appropriate term, codevectors) can a priori take values from a Euclidean space (as well as noise vectors),
the definition of the channel capacity has to be modified, by introducing a power
constraint. More generally, the value of capacity for a channel will depend upon
the so-called regional constraints which can generate analytic difficulties. In the
case of MAGC, the way was shown by Shannon, but it took some years to make
his analysis rigorous.
An input word of length N (designed to use the channel over N slots in succession) is identified with an input N-vector

x1

x(= x(N) ) = ... .
xN
We assume that xi R and hence x(N) RN (to make the notation shorter, the upper
index (N) will be often omitted).
In anadditive
channels an input vector x is transformed to a random vector
Y1

Y(N) = ... where Y = x + Z, or, component-wise,
YN
Y j = x j + Z j , 1 j N.

(4.1.1)

Here and below,

Z1

Z = ...

ZN
is a noise vector composed of random variables Z1 , . . . , ZN . Thus, the noise can be
characterised by a joint PDF f no (z) 0 where

z1
..
z= .
zN

368

Further Topics from Information Theory

and the total integral

f no (z)dz1 . . . dzN = 1. The N-fold noise probability distri-

bution is determined by integration over a given set of values for Z:

P (Z A) =

f no (z)dz1 . . . dzN , for A RN .

Example 4.1.1

An additivechannel
is called Gaussian (an AGC, in short) if,
Z1

for each N, the noise vector ... is a multivariate normal; cf. PSE I, p. 114.
ZN
We assume from now on that the mean value EZ j = 0. Recall that the multivariate
normal distribution with the zero mean is completely determined by its covariance
matrix. More precisely, the joint PDF fZno(N) (z(N) ) for an AGC has the form

z1

1
1 T 1
..
N
(4.1.2)

1/2 exp 2 z z , z = . R .
N/2
(2 )
det
zN
Here is an N N matrix assumed

to be real, symmetric and strictly positive definite, with entries j j = E Z j Z j representing the covariance of noise random
variables Z j and Z j , 1 j, j N. (Real strict positive definiteness means that is
of the form BBT where B is an N N real invertible matrix; if is strictly positive
definite then has N mutually orthogonal eigenvectors, and all N eigenvalues of
are greater than 0.) In particular, each random variable Z j is normal: Z j N(0, 2j )
where 2j = EZ 2j coincides with the diagonal entry j j . (Due to strict positive definiteness, j j > 0 for all j = 1, . . . , N.)
If in addition the random variables Z1 , Z2 , . . . are IID, the channel is called memoryless Gaussian (MGC) or a channel with (additive) Gaussian white noise. In this
case matrix is diagonal: i j = 0 when i = j and ii > 0 when i = j. This is an
important model example (both educationally and practically) since it admits some
nice final formulas and serves as a basis for further generalisations.
Thus, an MGC has IID noise random variables Zi N(0, 2 ) where 2 =
independence is equivalent to decorreVarZi = EZi2 . For normal random

variables,
lation. That is, the equality E Z j Z j = 0 for all j, j = 1, . . . , N with j = j implies
that the components Z1 , . . . , ZN of the noise vector Z(N) are mutually independent.
This can be deduced from (4.1.2): if matrix has j j = 0 for j = j then is
diagonal, with det = j j , and the joint PDF in (4.1.2) decomposes into a
1 jN

product of N factors representing individual PDFs of components Z j , 1 j N:

z2j
1
.
(4.1.3)
1/2 exp 2

jj
1 jN 2 j j

4.1 Gaussian channels and beyond

369

Moreover, under the IID assumption, with j j 2 > 0, all random variables Z j
N(0, 2 ), and the noise distribution for an MGC is completely specified by a single
parameter > 0. More precisely, the joint PDF from (4.1.3) is rewritten as

N

1
1

exp 2 z2j .
2 1 jN
2
It is often convenient to think that an infinite random sequence Z
1 = {Z1 , Z2 , . . .}
(N)
is given, and the above noise vector Z is formed by the first N members of this
sequence. In the Gaussian case, Z
1 is called a random Gaussian process; with
EZ j
0, this
process

is determined, like before, by its covariance , with i j =
Cov Zi , Z j = E Zi Z j . The term white Gaussian noise distinguishes this model
from a more general model of a channel with coloured noise; see below.
Channels with continuously distributed noise are analysed by using a scheme
similar to the one adopted in the discrete case: in particular, if the channel is used
for transmitting one of M 2RN , R < 1, encoded messages, we need a codebook
that consists of M codewords of length N: xT (i) = (x1 (i), . . . , xN (i)), 1 i M:

x1 (M)

x1 (1)
(
)

XM,N = x(N) (1), . . . , x(N) (M) = ... , . . . , ... . (4.1.4)

xN (1)
xN (M)
The codebook is, of course, presumed to be known to both the sender and the
receiver. The transmission rate R is given by
R=

log2 M
.
N

(4.1.5)

Now suppose thata codevectorx(i) had been sent. Then the received random
x1 (i) + Z1

..
vector Y(= Y(i)) =
is decoded by using a chosen decoder d : y
.
xN (i) + ZN
d(y) XM,N . Geometrically, the decoder looks for the nearest codeword x(k),
relative to a certain distance (adapted to the decoder); for instance, if we choose to
use the Euclidean distance then vector Y is decoded by the codeword minimising
the sum of squares:

d(Y) = arg min

(Y j (i) x j (l))2 : x(l) XM,N ;

(4.1.6)

1 jN

when d(y) = x(i) we have an error. Luckily, the choice of a decoder is conveniently
resolved on the basis of the maximum-likelihood principle; see below.

370

Further Topics from Information Theory

There is an additional subtlety here: one assumes that, for an input word x to get a
chance of successful decoding, it should belong to a certain transmittable domain
in RN . For example, working with an MAGC, one imposes the power constraint
1
x2j
N 1
jN

(4.1.7)

where > 0 is a given constant. In the context of wireless transmission this means
that the amplitude square power per signal in an N-long input vector should be
bounded by , otherwise the result of transmission is treated as undecodable.
Geometrically, in order to perform decoding, the inputcodeword x(i) constituting

the codebook must lie inside the Euclidean ball BN2 ( N) of radius r = N
centred at 0 RN :

1/2

x1

..
(N)
2
x

r
B2 (r) = x = . :
.
j

1 jN
xN
The subscript 2 stresses that RN with the standard Euclidean distance is viewed as
a Hilbert 2 -space.
In fact, it is not required that the whole codebook XM,N lies in a decodable
domain; the agreement is only that if a codeword x(i) falls outside then it is decoded
wrongly with probability 1. Pictorially, the requirement is that most of codewords
lie within BN2 ((N )1/2 ) but not necessarily all of them. See Figure 4.1.
A reason for the regional constraint (4.1.7) is that otherwise the codewords
can be positioned in space at an arbitrarily large distance from each other, and,
eventually, every transmission rate would become reliable. (This would mean that
the capacity of the channel is infinite; although such channels should not be dismissed outright, in the context of an AGC the case of an infinite capacity seems
impractical.)
Typically, the decodable region D(N) RN is represented by a ball in RN , centred
at the origin, and specified relative to a particular distance in RN . Say, in the case
of exponentially distributed noise it is natural to select

..
(N)
(N)
D = B1 (N ) = x = . : |x j | N

1 jN
xN
the ball in the 1 -metric. When an output-signal vector falling within distance r
from a codeword is decoded by this codeword, we have a correct decoding if (i)
the output signal falls in exactly one sphere around a codeword, (ii) the codeword
in question lies within D(N) , and (iii) this specific codeword was sent. We have
possibly an error when more than one codeword falls into the sphere.

4.1 Gaussian channels and beyond

371

Figure 4.1

As in the discrete case, a more general channel is represented by a family of

(conditional) probability distributions for received vectors of length N given that
an input word x(N) RN has been sent:
(N)

(N)

Pch ( | x(N) ) = Pch ( |word x(N) sent), x RN .

(4.1.8)

As before, N = 1, 2, . . . indicates how many slots of the channel were used for transmission, and we will consider the limit N . Now assume that the distribution
(N)
(N)
Pch ( | x(N) ) is determined by a PMF fch (y(N) | x(N) ) relative to a fixed measure
(N) on RN :
(N)
Pch (Y(N)

A| x

(N)

fch ( | x(N) )d (N) (y(N) ).

(4.1.9a)

A typical assumption is that (N) is a product-measure of the form

(N) = (N times);

(4.1.9b)

for instance, (N) can be the Lebesgue measure on RN which is the product of
Lebesgue measures on R: dx(N) = dx1 dxN . In the discrete case where
digits xi represent letters from an input channel alphabet A (say, binary, with
A = {0, 1}), is the counting measure on A , assigning weight 1 to each symbol of the alphabet. Then (N) is the counting measure on A N , the set of all input
words of length N, assigning weight 1 to each such word.

372

Further Topics from Information Theory

Assuming the product-form reference measure (N) (4.1.9b), we specify a mem(N)

oryless channel by a product form PMF fch (y(N) | x(N) ):
(N)

fch (y(N) | x(N) ) =

fch (y j |x j ).

(4.1.10)

1 jN

Here fch (y|x) is the symbol-to-symbol channel PMF describing the impact of a
single use of the channel. For an MGC, fch (y|x) is a normal N(x, 2 ). In other
words, fch (y|x) gives the PDF of a random variable Y = x + Z where Z N(0, 2 )
represents the white noise affecting an individual input value x.
Next, we turn to a codebook XM,N , the image of a one-to-one map M RN
where M is a finite collection of messages (originally written in a message alphabet); cf. (4.1.4). As in the discrete case, the ML decoder dML decodes the received
(N)
word Y = y(N) by maximising fch (y| x) in the argument x = x(N) XM,N :

(N)
(4.1.11)
dML (y) = arg max fch (y| x) : x XM,N .
The case when maximiser is not unique will be treated as an error.
(N),
Another useful example is the joint typicality (JT) decoder dJT = dJT (see
below); it looks for the codeword x such that x and y lie in the -typical set TN :
dJT (y) = x if x XM,N and (x, y) TN .

(4.1.12)

The JT decoder is designed via a specific form of set TN for codes generated as
samples of a random code X M,N . Consequently, for given output vector yN and a
code XM,N , the decoded word dJT (y) XM,N may be not uniquely defined (or not
defined at all), again leading to an error. A general decoder should be understood
as a one-to-one map defined on a set K(N) RN taking points yN KN to points
x XM,N ; outside set K(N) it may be not defined correctly. The decodable region
K(N) is a part of the specification of decoder d (N) . In any case, we want to achieve

(N)
(N)
Pch d (N) (Y) = x|x sent = Pch Y K(N) |x sent

(N)
+ Pch Y K(N) , d(Y) = x|x sent 0
as N . In the case of an MGC, for any code XM,N , the ML decoder from
(4.1.6) is defined uniquely almost everywhere in RN (but does not necessarily give
the right answer).
We also require that the input vector x(N) D(N) RN and when x(N) D(N) ,
the result of transmission is rendered undecodable (regardless of the qualities of the
decoder used). Then the average probability of error, while using codebook XM,N
and decoder d (N) , is defined by
eav (XM,N , d (N) , D(N) ) =

1
e(x, d (N), D(N) ),
M xX
M,N

(4.1.13a)

4.1 Gaussian channels and beyond

and the maximum probability of error by

emax (XM,N , d (N) , D(N) ) = max e(x, d (N) , D(N) ) : x XM,N .

373

(4.1.13b)

Here e(x, d (N) , D(N) ) is the probability of error when codeword x had been transmitted:

1, x D(N) ,

(4.1.14)
e(x, d (N) , D(N) ) =
(N) (Y) = x|x , x D(N) .
P(N)
ch d
In (4.1.14) the order of the codewords in the codebook XM,N does not matter;
thus XM,N may be regarded simply as a set of M points in the Euclidean space RN .
Geometrically, we want the points of XM,N to be positioned so as to maximise the
chance of correct ML-decoding and lying, as a rule, within domain D(N) (which
again leads us to a sphere-packing problem).
To 3this end,
4 suppose that a number R > 0 is fixed, the size of the codebook XM,N :
M = 2NR . We want to define a reliable transmission rate as N in a fashion
similar to how it was done in Section 1.4.
Definition 4.1.2 Value R >30 is 4called a reliable transmission rate with regional
constraint D(N) if, with M = 2NR , there exist a sequence {XM,N } of codebooks
XM,N RN and a sequence {d (N) } of decoders d (N) : RN RN such that
lim eav (XM,N , d (N) , D(N) ) = 0.

(4.1.15)

Remark 4.1.3 It is easy to verify that a transmission rate R reliable in the sense of
average error-probability eav (XM,N , d (N) , D(N) ) is reliable for the maximum errorprobability emax (XM,N , d (N) , D(N) ). In fact, assume that R is reliable in the sense of
Definition 4.1.2, i.e. in the sense of the average error-probability. Take a sequence
{XM,N } of the corresponding codebooks with M = 2RN and a sequence {dN } of
(0)
the corresponding decoding rules. Divide each code XN into two halves, XN and
(1)
XN , by ordering the codewords in the non-decreasing order of their probabilities
(0)
of erroneous decoding and listing the first M (0) = M/2 codewords in XN and
(1)
(0)
the rest, M (1) = M M (0) , in XN . Then, for the sequence of codes {XM,N }:
(i) the information rate approaches the value R as N as
1
log M (0) R + O(N 1 );
N
(ii) the maximum error-probability, while using the decoding rule dN ,

1
M
(0)
Pemax XN , dN (1) Pe (x(N) , dN ) (1) Peav (XN , dN ) .
M
M
(1)
(N)
x

374

Further Topics from Information Theory

Since M/M (1) 2, the RHS tends to 0 as N . We conclude that R is

a reliable transmission rate for the maximum error-probability. The converse
assertion, that a reliable transmission rate R in the sense of the maximum errorprobability is also reliable in the sense of the average error-probability, is obvious.
Next, the capacity of the channel is the supremum of reliable transmission rates:

C = sup R > 0 : R is reliable ;

(4.1.16)

it varies from channel to channel and with the shape of constraining domains.
It turns out (cf. Theorem 4.1.9 below) that for the MGC, under the average power
constraint threshold (see (4.1.7)), the channel capacity C( , 2 ) is given by the
following elegant expression:
C( , 2 ) =

1

log2 1 + 2 .
2

(4.1.17)

Furthermore, like in Section 1.4, the capacity C( , 2 ) is achieved by a sequence of random codings where codeword x(i) = (X1 (i), . . . , XN (i)) has IID
components X j (i) N(0, N ), j = 1, . . . , N, i = 1, . . . , M, with N 0 as
N . Although such random codings do not formally obey the constraint
(4.1.7) for finite N, it is violated with a vanishing
probability as N (since

1
2
X j (i) : 1 i M = 1 with a proper choice
lim supN P max
N 1
jN
of N ). Consequently, the average error-probability (4.1.13a) goes to 0 (of course,
for a random coding the error-probability becomes itself random).
Example 4.1.4 Next, we discuss an AGC with coloured Gaussian noise. Let a
codevector x = (x1 , . . . , xN ) have multi-dimensional entries

x j1
..
x j = . Rk ,

1 j N,

x jk
and the components Z j of the noise vector

Z1

Z = ...

4.1 Gaussian channels and beyond

375

are also random vectors of dimension k:

Z j1

Z j = ... .
Z jk

For instance, Z1 , . . . , ZN may be IID N(0, ) (with k-variate normal), where is a

given k k covariance matrix.
The coloured model arises when one uses a system of k scalar Gaussian channels in parallel. Here, a scalar signal x j1 is sent through channel 1, x j2 through
channel 2, etc., at the jth use of the system. A reasonable assumption is that at
each use the scalar channels produce jointly Gaussian noise; different channels
may be independent (with matrix being k k diagonal) or dependent (when is
a general positive-definite k k matrix).
Here a( codebook, as) before, is an (ordered or unordered) collection
XM,N = x(1), . . . , x(M) where each codeword x(i) is a multi-vector
(x1 (i), . . . , xN (i))T RkN := Rk Rk . Let Q be a positive-definite k k matrix commuting with : Q = Q. The power constraint is now
J
I
1
x j (i), Qx j (i) .

N 1 jN

(4.1.18)

The formula for the capacity of an AGC with coloured noise is, not surprisingly,
more complicated. As Q = Q, matrices and Q may be simultaneously diagonalised. Let i and i , i = 1, . . . , k, be the eigenvalues of and Q, respectively
(corresponding to the same eigenvectors). Then

(l1 l )+
1
C( , Q, ) =
,
(4.1.19)
log2 1 +
2 1lk
l
1

where (l1 l )+ = max

,
0
. In other words, (l1 l )+ are the
l
l

1
eigenvalues of the matrix Q + representing the positive-definite part of the
Hermitian matrix Q1 . Next, = ( ) > 0 is determined from the condition

(4.1.20)
tr I Q + = .

The positive-definite part I Q + is in turn defined by

I Q + = + I Q +
where + is the orthoprojection (in Rk ) onto the subspace
spanned by
the eigenvectors of Q with eigenvalues l l < . In (4.1.20) tr I Q + 0 (since
tr AB 0 for all pair of positive-definite matrices), equals 0 for = 0 (as

376

Further Topics from Information Theory

Q + = 0) and monotonically increases with to +. Therefore, for any given

> 0, (4.1.20) determines the value of = ( ) uniquely.
Though (4.1.19) looks much more involved than (4.1.17) both expressions are
corollaries of two facts: (i) the capacity can be identified as the maximum of the
mutual entropy between the (random) input and output signals, just as in the discrete case (cf. Sections 1.3 and 1.4), and (ii) the mutual information in the case
of a Gaussian noise (white or coloured) is attained when the input signal is itself
Gaussian whose covariance solves an auxiliary optimisation problem. In the case
of (4.1.17) this optimisation problem is rather simple, while for (4.1.19) it is more
complicated (but still has a transparent meaning).
Correspondingly, the random encoding achieving the capacity C( ; Q; ) is
where signals X j (i), 1 j N, i = 1, . . . , M, are IID, and X j (i) N(0, A N I)
where A is the k k positive-definite matrix maximising the determinant det(A+)
to the constraint tr QA = ; such a matrix turns out to be of the form
subject
Q1 + . The random encoding provides a convenient tool for calculating the
capacity in various models. We will discuss a number of such models in Worked
Examples.
The notable difference emerging for channels with continuously distributed
noise is that the entropy should be replaced when appropriate with the differential entropy. Recall the differential entropy introduced in Section 1.5. The mutual
entropy between two random variables X and Y with the joint PMF fX,Y (x, y) rel1
ative to a reference measure and marginal PMFs fX (x) = fX,Y (x, y) (dy)
1
and fY (y) = fX,Y (x, y) (dx) is
fX,Y (X,Y )
fX (X) fY (Y )
0
fX,Y (x, y)
= fX,Y (x, y) log
(dx) (dy).
fX (x) fY (y)

I(X : Y ) = E log

A similar definition works when X and Y are replaced by random vectors X =

(X1 , . . . , XN ) and Y = (Y1 , . . . ,YN ) (or even multi-vectors where as in Example
4.1.4 components X j and Y j are vectors themselves):

(N)

I(X

(N )

) = E log

fX(N) ,Y(N ) (X(N) , Y(N ) )

fX(N) (X(N) ) fY(N ) (Y(N ) )

(4.1.21a)

Here fX(N) (x(N) ) and fY(N ) (y(N ) ) are the marginal PMFs for X(N) and Y(N ) (i.e.
joint PMFs for components of these vectors).

4.1 Gaussian channels and beyond

377

Specifically, if N = N , X(N) represents a random input and Y(N) = X(N) + Z(N)

the corresponding random output of a channel with a (random) probability of error:

1,
x(N) D(N) ,

E(x(N) , D(N) ) =
(N) ) = x(N) |x(N) , x(N) D(N) ;
P(N)
ch dML (Y
cf. (4.1.14). Furthermore, we are interested in the expected value

E (PX(N) ; D(N) ) = E E(X(N) , D(N) ) .

(4.1.21b)

Next, given > 0, we can define the supremum of the mutual information per
signal (i.e. per a single use of the channel), over all input probability distributions
PX(N) with E (PX(N) , D(N) ) :

1
sup I(X(N) : Y(N) ) : E (PX(N) , D(N) ) ,
N
C = lim sup C ,N C = lim inf C .

C ,N =

(4.1.22)
(4.1.23)

We want to stress that the supremum in (4.1.22) should be taken over all probability distributions PX(N) of the input word X(N) with the property that the expected
error-probability is , regardless of whether these distributions are discrete or
continuous or mixed (contain both parts). This makes the correct evaluation of
CN, quite difficult. However, the limiting value C is more amenable, at least in
some important examples.
We are now in a position to prove the converse part of the Shannon second
coding theorem:
Theorem 4.1.5 (cf. Theorems 1.4.14 and
a channel given by

2.2.10.) Consider
(N)
a sequence of probability distributions Pch | x sent for the random output

words Y(N) and decodable domains D(N) . Then quantity C from (4.1.22), (4.1.23)
gives an upper bound for the capacity:
C C.

(4.1.24)

Proof Let R be a reliable transmission rate and {XM,N } be a sequence of codebooks with M = XM,N 2NR for which lim eav (XM,N , D(N) ) = 0. Consider the
N

(N)

pair (x, dML (y)) where (i) x = xeq is the random input word equidistributed over
XM,N , (ii) Y = Y(N) is the received word and (iii) dML (y) is the codeword guessed
while using the ML decoding rule dML after transmission. Words x and dML (Y) run

378

Further Topics from Information Theory

jointly over XM,N , i.e. have a discrete-type joint distribution. Then, by the generalised Fano inequality (1.2.23),
hdiscr (X|d(Y)) 1 + log(M 1)

P(x = x, dML (Y) = x)

XXM,N

NR
Pch (dML (Y) = x|x sent)
M xX
M,N

= 1 + NReav (XM,N , D(N) ) := N N ,

(N)

where N 0 as N . Next, with h(Xeq ) = log M, we have NR 1 h(Xeq ).

Therefore,
(N)

1 + h(Xeq )
N
1
1
(N)
(N)
= I(Xeq : d(Y(N) )) + h(Xeq |d(Y(N) ))
N
N
1 (N) (N)
1
+ I(xeq : Y ) + N .
N
N
For any given > 0, for N sufficiently large, the average error-probability will
satisfy eav (XM,N , D(N) ) < . Consequently, R C ,N , for N large enough. (Because
eav (XM,N , D(N) ) gives a specific
the equidistribution over a codebook XM,N with

example of an input distribution PX(N) with E PX(N) , D(N) ) .) Thus, for all > 0,
R C , implying that the transition rate R C. Therefore, C C, as claimed.
R

The bound C C in (4.1.24) becomes exact (with C = C) in many interesting situations. Moreover, the expression for C simplifies in some cases of interest. For example, for an MAGC instead of maximising the mutual information I(X(N) : Y(N) )
for varying N it becomes possible to maximise I(X : Y ), the mutual information between single input and output signals subject to an appropriate constraint. Namely,
for an MAGC,

C = C = sup I(X : Y ) : EX 2 < .
(4.1.25a)

The quantity sup I(X : Y ) : EX 2 is often called the information capacity of

an MAGC, under the square-power constraint . Moreover, for a general AGC,

1
1
(N)
(N)
2
(4.1.25b)
C = C = lim sup I(X : Y ) :
EX j < .
N N
N 1
jN
Example 4.1.6 Here we estimate the capacity C( , 2 ) of an MAGC with additive white Gaussian noise of variance 2 , under the average power constraint (with
D(N) = B(N) ((N )1/2 ) (cf. Example 4.1.1.), i.e. bound from above the right-hand
side of (4.1.25b).

4.1 Gaussian channels and beyond

379

Given an input distribution PX(N) , we write

I(X(N) : Y(N) ) = h(Y(N) ) h(Y(N) |X(N) )
= h(Y(N) ) h(Z(N) )

1 jN

h(Y j ) h(Z(N) )

h(Y j ) h(Z j ) .

(4.1.26)

1 jN

Denote by 2j = EX j2 the second moment of the single-input random variable X j ,

the jth entry of the random input vector X(N) . Then the corresponding random
output random variable Y j has
EY j2 = E(X j + Z j )2 = EX j2 + 2EX j Z j + EZ 2j = 2j + 2 ,
as X j and Z j are independent and EZ j = 0.
Note that for a Gaussian channel, Y j has a continuous distribution (with the PDF
1
fY j (y) given by the convolution 2 (x y)dFX j (x) where 2 is the PDF of Z j
N(0, 2 )). Consequently, the entropies figuring in (4.1.26) and implicitly in
(4.1.25a,b) are the differential entropies. Recall that for a random variable Y j with
a PDF fY j , under the condition EY j2 2j + 2 , the maximum of the differential
1
entropy h(Y j ) log2 [2 e( 2j + 2 )]. In fact, by Gibbs,
2
h(Y j ) =

0
0

fY j (y) log2 fY j (y)dy

fY j (y) log2 2j + 2 (y)dy

log2 e
1
log2 2 ( 2j + 2 ) +
EY 2
2
2( 2j + 2 ) j

1
log2 2 e( 2j + 2 ) ,
2
=

and consequently,
I(X j : Y j ) = h(Y j ) h(Z j )
log2 [2 e( 2j + 2 )] log2 (2 e 2 )

2j
= log2 1 + 2 ,

with equality iff Y j N(0, 2j + 2 ).

380

Further Topics from Information Theory

EX j2 = 2j < N in (4.1.25b) implies, by the law of large

1 jN

numbers, that lim PX(N) B(N) ( N ) = 1. Moreover, for any input probability
The bound

1 jN
N

distribution PX(N) with EX j2 2j , 1 j N, we have that

2j
1
1
(N)
(N)
I(X : Y )
log2 1 + 2 .
N
2N 1

jN
The Jensen inequality, applied to the concave function x log2 (1 + x), implies

2j
2j
1
1
1
log2 1 + 2 log2 1 +
2
2N 1

2
N 1
jN
jN

1
log2 1 + 2 .
2

Therefore, in this example, the information capacity C, taken as the RHS of

(4.1.25b), obeys

1

(4.1.27)
C log2 1 + 2 .
2

After establishing Theorem 4.1.8, we will be able to deduce that the capacity
C( , 2 ) equals the RHS, confirming the answer in (4.1.17).
Example 4.1.7
repeated:

For the coloured Gaussian noise the bound from (4.1.26) can be
I(X(N) : Y(N) )

[h(Y j ) h(Z j )].

1 jN

Here we work with the mixed second-order moments for the random vectors of
input and output signals X j and Y j = X j + Z j :

2j = E"X j , QX j #, E"Y j , QY j # = 2j + tr (Q),

1
2j .
N 1
jN

In this calculation we again made use of the fact that X j and Z j are independent
and the expected value EZ j = 0.
1
Next, as in the scalar case, I(X(N) : Y(N) ) does not exceed the difference
N
h(Y ) h(Z) where Z N(0, ) is the coloured noise vector and Y = X + Z is
a multivariate normal distribution maximising the differential entropy under the
trace restriction. Formally:
1
I(X(N) : Y(N) ) h( , Q, ) h(Z)
N

4.1 Gaussian channels and beyond

381

where K is the covariance matrix of a signal, and

1
max log (2 )k e det(K + ) :
2
+
K positive-definite k k matrix with tr (QK) .

h( , Q, ) =

Write in the diagonal form = CCT where C is an orthogonal and the diagonal k k matrix formed by the eigenvalues of :

1 0 . . . 0
0 2 . . . 0

=
.
.
.
0 0
. 0
0

. . . k

Write CT KC = B and maximise det(B + ) subject to the constraint

B positive-definite and tr (B) where = CT QC.
By the Hadamard inequality of Worked Example 1.5.10, det(B + )
(Bii + i ), with equality iff B is diagonal (i.e. matrices and K have the same

1ik

eigenbasis), and B11 = 1 , . . . , Bkk = k are the eigenvalues of K. As before, assume Q = Q, then tr (B) = i i . So, we want to maximise the product
1ik

(i + i ), or equivalently, the sum

1ik

log(i + i ), subject to 1 , . . . , k 0 and

1ik

i i .

1ik

If we discard the regional constraints 1 , . . . , k 0, the Lagrangian

L (1 , . . . , k ; ) =

log(i + i ) +

1ik

i i

1ik

is maximised at
1
1
= i , i.e. i =
i , i = 1, . . . , k.
i + i
i
To satisfy the regional constraint, we take

1
i =
i
, i = 1, . . . , k,
i
+
and adjust > 0 so that

1ik

1
i i

= .
+

(4.1.28)

382

Further Topics from Information Theory

This yields that the information capacity C( , Q, ) obeys

(l1 l )+
1
,
C( , Q, )
log2 1 +
2 1lk
l

(4.1.29)

where the RHS comes from (4.1.28) with = 1/ . Again, we will show that the
capacity C( , Q, ) equals the last expression, confirming the answer in (4.1.19).
We now pass to the direct part of the second Shannon coding theorem for general
channels with regional restrictions. Although the statement of this theorem differs
from that of Theorems 1.4.15 and 2.2.1 only in the assumption of constraints upon
the codewords (and the proof below is a mere repetition of that of Theorem 1.4.15),
it is useful to put it in the formal context.
Theorem 4.1.8 Let a channel be specified by a sequence of conditional probabili
(N)
ties Pch | x(N) sent for the received word Y(N) and a sequence of decoding con
(N)
straints x(N) D(N) for the input vector. Suppose that probability Pch | x(N) sent
is given by a PMF fch (y(N) |x(N) sent) relative to a reference measure (N) . Given
c > 0, suppose that there exists a sequence of input probability distributions PX(N)
such that
(i) lim PX(N) (D(N) ) = 1,
N

(ii) the distribution PX(N) is given by a PMF fX(N) (x(N) ) relative to a reference
measure (N) ,
(iii) the following convergence in probability holds true: for all > 0,
N
limN PX(N) ,Y(N) T = 1,

(N) , Y(N) )
1

f
(N) ,Y(N) (x

x
c ,
TN = log+
N

fX(N) (x(N) ) fY(N) (Y(N) )

(4.1.30a)

where
(N)
(N)
(N) (N)
fX(N) ,Y(N) (x(N)
0 , y ) = fX(N) (x ) fch (y |x sent),
fY(N) (y(N) ) = fX(N) (xN ) fch (y(N) |x(N) sent) N dx(N) .

(4.1.30b)

Then the capacity of the channel satisfies C c.

*
+
Proof Take R < c and consider a random codebook xN (1), . . . , xN (M) , with
M 2NR , composed by IID codewords where each codeword xN ( j) is drawn according to PN = PxN . Suppose that a (random) codeword xN ( j) has been sent and a

4.1 Gaussian channels and beyond

383

(random) word YN = YN ( j) received, with the joint PMF fX(N) ,Y(N) as in (4.1.30b).
We take > 0 and decode YN by using joint typicality:
dJT (YN ) = xN (i) when xN (i) is the only vector among
xN (1), . . . , xN (M) such that (xN (i), YN ) TN .
Here set TN is specified in (4.1.30a).
Suppose a random vector xN ( j) has been sent. It is assumed that an error occurs
every time when
(i) xN ( j) D(N) , or
(ii) the pair (xN ( j), YN ) TN , or
(iii) (xN (i), YN ) TN for some i = j.
These possibilities do not exclude each other but if none of them occurs then
(a) xN ( j) D(N) and
(b) x( j) is the only word among xN (1), . . . , xN (M) with (xN ( j), YN ) TN .
Therefore, the JT decoder will return the correct result. Consider the average errorprobability
1
E( j, PN )
EM (PN ) =
M 1
jM
where E( j, PN ) is the probability that any of the above possibilities (i)(iii) occurs:
*
+ *
+
E( j, PN ) = P xN ( j) D(N) (xN ( j), YN ) TN
+
*
(xN (i), YN ) TN for some i = j

= E1 xN ( j) D(N)

+ E1 xN ( j) D(N) , dJT (YN ) = xN ( j) . (4.1.31)
The symbols P and E in (4.1.31) refer to (1) a collection of IID input vectors
xN (1), . . . , xN (M), and (2) the output vector YN related to xN ( j) by the action of
the channel. Consequently, YN is independent of vectors xN (i) with i = j. It is instructive to represent the corresponding probability distribution P as the Cartesian
product; e.g. for j = 1 we refer in (4.1.31) to
P = PxN (1),YN (1) PxN (2) PxN (M)
where PxN (1),YN (1) stands for the joint distribution of the input vector xN (1) and the
output vector YN (1), determined by the joint PMF
fxN (1),YN (1) (xN , yN ) = fxN (1) (xN ) fch (yN |xN sent).

384

Further Topics from Information Theory

By symmetry, E( j, PN ) does not depend on j, thus in the rest of the argument we

can take j = 1. Next, probability E(1, PN ) does not exceed the sum of probabilities

P xN (1) D(N) + P (xN (1), YN ) TN
M

+ P (xN (i), YN ) TN .
i=2

Thanks to the condition that lim Px(N) (D(N) ) = 1, the first summand vanishes as
N

N . The second summand vanishes, again in the limit N , because of

M

(4.1.30a). It remains to estimate the sum P (xN (i), YN ) TN .
i=2

First, note that, by symmetry, all summands are equal, so

N

(x (i), YN ) TN = 2NR 1 P (xN (2), YN ) TN .

i=2

Next, by Worked Example 4.2.3 (see (4.2.9) below)

P (xN (2), YN ) TN 2N(c3 )
and hence
m

N
(x (i), YN ) TN 2N(Rc+3 )

i=2

which tends to 0 as N when < (c R)/3.

Therefore, for R < c, lim EM (PN ) = 0. But EM (PN ) admits the representation
N

Em (P ) = EPxN (1) PxN (M)
N

1
E( j)
M 1
jM

where quantity E( j) represents the error-probability as defined in (4.1.14):

'
1,
xN D(N) ,

E( j) =
(N),
Pch dJT (YN ) = xN ( j)|xN ( j) sent , xN D(N) .
We conclude that there exists a sequence of sample codebooks XM,N such that
the average error-probability
1
e(x) 0
M xX
M,N

4.1 Gaussian channels and beyond

385

(N),

where e(x) = e(x, XM,N , D(N) , dJT ) is the error-probability for the input word x
in code XM,N , under the JT decoder and with regional constraint specified by D(N) :

1,
xN D(N) ,

e(x) =
(N),
Pch dJT
(YN ) = x|x sent , xN D(N) .
Hence, R is a reliable transmission rate in the sense of Definition 4.1.2. This completes the proof of Theorem 4.1.8.
We also have proved in passing the following result.
Theorem 4.1.9 Assume that the conditions of Theorem 4.1.5 hold true. Then,
for all R < C, there exists a sequence of codes XM,N of length N and size M 2RN
such that the maximum probability of error tends to 0 as N .
Example 4.1.10 Theorem 4.1.8 enables us to specify the expressions in (4.1.17)
and (4.1.19) as the true values of the corresponding capacities (under the ML rule):
for a scalar white noise of variance 2 , under an average input power constraint
x2j N ,
1 jN

C( , 2 ) =

1

log 1 + 2 ,
2

for a vector white noise with variances 2 = (12 , . . . , k2 ), under the constraint
xTj x j N ,
1 jN

C( , 2 ) =

1
( i2 )+
log
1
+
, where

2 1ik
i2

( i2 )+ = 2 ,

1ik

and for the coloured vector noise with a covariance matrix , under the constraint
xTj Qx j N ,
1 jN

(i1 i )+
1
,
C( , Q, ) =
log 1 +
2 1ik
i
where ( i i )+ = .
1ik

Explicitly, for a scalar white noise we take the random coding where the signals
X j (i), 1 j N, 1 i M = 2NR , are IID N(0, ). We have to check the
conditions of Theorem 4.1.5 in this case: as N ,

(i) lim P(x(N) (i) B(N) ( N ), for all i = 1, . . . , M) = 1;

386

Further Topics from Information Theory

(ii) lim lim N = C( , 2 ) in probability where

0 N

N =

1
P(X,Y )
.
log
N 1
P
X (X)PY (Y )
jM

First, property (i):

P x(N) (i) B(N) ( N ), for some i = 1, . . . , M

1
P
X j (i)2
NM 1iM
1 jN

1
2
2
=P
(X j (i) )
NM 1iM
1 jN

1
2
2 2
0.
E(X )
NM 2

Next, (ii): since pairs (X j ,Y j ) are IID, we apply the law of large numbers and
obtain that
P(X,Y )
= I(X1 : Y1 ).
N E log
PX (X)PY (Y )
But
I(X1 : Y1 ) = h(Y1 ) h(Y1 |X1 )

1

1
= log 2 e( + 2 ) log 2 e 2
2
2

1
C( , 2 ) as 0.
= log 1 +
2
2
Hence, the capacity equals C( , 2 ), as claimed. The case of coloured noise is
studied similarly.
Remark 4.1.11 Introducing a regional constraint described by a domain D does
not mean one has to secure that the whole code X should lie in D. To guarantee
that the error-probability Peav (X ) 0 we only have to secure that the majority
of codewords x(i) X belong to D when the codeword-length N .
Example 4.1.12
noise vector

Here we consider a non-Gaussian additive channel, where the

Z1

Z = ...

4.1 Gaussian channels and beyond

387

has two-side exponential IID components Z j (2) Exp( ), with the PDF
1
fZ j (z) = e |z| , < z < ,
2
where Exp denotes the exponential distribution, > 0 and E|Z j | = 1/ (see PSE I,
Appendix). Again we will calculate the capacity under the ML rule and with a
regional constraint x(N) (N ) where
(
)
(N ) = x(N) RN : |x j | N .
1 jN

First, observe that if the random variable X has E|X| and the random variable
Z has E|Z| then E|X + Z| + . Next, we use the fact that a random variable
Y with PDF fY and E|Y | has the differential entropy
h(Y ) 2 + log2 ; with equality iff Y (2) Exp(1/ ).
In fact, as before, by Gibbs
h(Y ) =

0
0

fY (y) log fY (y)dy

fY (y) log (2) Exp(1/ ) (y)dy
0

1
fY (y)|y|dy + log

= 1 + log 2 + log
1
+ E|Y |

= 1+

(2) Exp(1/ ) (y) log (2) Exp(1/ ) (y)dy,

and the equalities are achieved only when fY = (2) Exp(1/ ) .

Then, by the converse part of the SSCT,
1
1 (N) (N)
I(x : Y ) = h(Y j ) h(Z j )
N
N

1
2 + log2 ( j + 1 ) 2 + log2 ( )
N

1
= log2 1 + j
N

log2 1 + .
The same arguments as before establish that the RHS gives the capacity of the
channel.

388

Further Topics from Information Theory

M = 13
m=6

inf

= ln 7

2b
_

Figure 4.2

Worked Example 4.1.13 Next, we consider a channel with an additive uniform

noise, where the noise random variable Z U(b, b) with b > 0 representing the
limit for the noise amplitude. Let us choose the region constraint for the input signal
as a finite set A R (an input alphabet) of the form A = {a, a + b, . . . , a + (M
1)b}. Compute the information capacity of the channel:

Cinf = sup I(X : Y ) : pX (A ) = 1,Y = X + Z .

Solution Because of the shift-invariance, we can assume that a = A and a + Mb
= A where 2A = Mb is the width of the input signal set. The formula I(X : Y ) =
h(Y ) h(Y |X) where h(Y |X) = h(Z) = ln(2b) (in nats) shows that we must maximise the output signal entropy h(Y ). The limits for Y are A b Y A + b, so
the distribution PY must be as close to uniform U(A b, A + b) as possible.
First, suppose M is odd: A = 2m + 1, with
A = {0, A/m, 2A/m, . . . , A} and b = A/m.
That is, the points of A partition the interval [A, A] into 2m intervals of length
A/m; the extended interval [A b, A + b] contains 2(m + 1) such intervals. The
maximising probability distribution PX can be spotted without calculations: it assigns equal probabilities 1/(m + 1) to m + 1 points
A, A + 2b, . . . , A 2b, A.
In other words, we cross off every second letter from A and use the remaining
letters with equal probabilities.
In fact, with PX (A) = PX (A + 2b) = = PX (A), the output signal PDF fY
assigns the value [2b(m + 1)]1 to every point y [A b, A + b]. In other words,
Y U(A b, A + b) as required. The information capacity Cinf in this case is
equal (in nats) to
ln(2A + 2b) ln 2b = ln (1 + m) .

(4.1.32)

4.1 Gaussian channels and beyond

389

Say, for M = 3 (three input signals, at A, 0, A, and b = A), Cinf = ln 2. For

M = 5 (five input signals, at A, A/2, 0, A/2, A, and b = A/2), Cinf = ln 3. See
Figure 4.2 for M = 13.
Remark 4.1.14 It can be proved that (4.1.32) gives the maximum mutual information I(X : Y ) between the input and output signals X and Y = X + Z when (i)
the noise random variable Z U(b, b) is independent of X and (ii) X has a general distribution supported on the interval [A, A] with b = A/m. Here, the mutual
information I(X : Y ) is defined according to Kolmogorov:
I(X : Y ) = sup I(X : Y )

(4.1.33)

where the supremum is taken over all finite partitions and of intervals [A, A]
and [A b, A + b], and X and Y stand for the quantised versions of random
variables X and Y , respectively.
In other words, the input-signal distribution PX with
PX (A) = PX (A + 2b) = = PX (A 2b) = PX (A) =

1
m+1

(4.1.34)

maximises I(X : Y ) under assumptions (i) and (ii). We denote this distribution by
(A,A/m)
(bm,b)
, or, equivalently, PX
.
PX
However, if M = 2m, i.e. the number A of allowed signals is even, the calculation becomes more involved. Here, clearly, the uniform distribution U(A
b, A + b) for the output signal Y cannot be achieved. We have to maximise h(Y ) =
h(X + Z) within the class of piece-wise constant PDFs fY on [A b, A + b]; see
below.
Equal spacing in [A, A] is generated by points A/(2m 1), 3A/(2m
1), . . . , A; they are described by the formula (2k 1)A/(2m 1) for k =
1, . . . , m. These points divide the interval [A, A] into (2m 1) intervals of length
2A/(2m 1). With Z U(b, b) and A = b(m 1/2), we again have the outputsignal PDF fY (y) supported in [A b, A + b]:

if b(m 1/2) y b(m + 1/2),

pm /(2b),

pk + pk+1 (2b), if b(k 1/2) y b(k + 1/2)

for k = 1, . . . , m 1,

fY (y) =
p1 + p1 (2b), if b/2 y b/2,

pk + pk+1 (2b), if b(k 1/2) y b(k + 1/2)

for k = 1, . . . , m + 1,

p /(2b),
if b(m + 1/2) y b(m 1/2),
m

390

Further Topics from Information Theory

where

1
(2k 1)A
pk = pX b k
=P X =
, k = 1, . . . , m,
2
2m 1
stand for the input-signal probabilities. The entropy h(Y ) = h(X + Z) is written as

pk + pk+1 pk + pk+1 p1 + p1 p1 + p1
pm pm
ln

ln

ln
2
2b 1k<m
2
2b
2
2b
pk + pk+1 pk + pk+1 pm pm
ln

ln
.

2
2b
2
2b
m<k1

It turns out that the maximising distribution PX has pk = pk , for k = 1, . . . , m.

Thus, we face an optimisation problem:
pm
pk + pk+1
p1
(pk + pk+1 ) ln
p1 ln
2b 1k<m
2b
b
(4.1.35)
subject to the probabilistic constraints pk 0 and 2 pk = 1. The Lagrangian
maximise G(p) = pm ln

1km

L (PX ; ) reads

L (PX ; ) = G(p) + (2p1 + + 2pm 1)

and is maximised when

L (PX ; ) = 0, k = 1, . . . , m.
pk
Thus, we have m equations, with the same RHS:
pm (pm1 + pm )
2 + 2 = 0, (implies) pm (pm1 + pm ) = 4b2 e2 2 ,
4b2
(pk1 + pk )(pk + pk+1 )
2 + 2 = 0,
ln
4b2
(implies) (pk1 + pk )(pk + pk+1 ) = 4b2 e2 2 , 1 < k < m,
2p1 (p1 + p2 )
2 + 2 = 0 (implies) 2p1 (p1 + p2 ) = 4b2 e2 2 .
ln
4b2
ln

This yields
pm = pm1 + pm2 = = p3 + p2 = 2p1 ,
pm + pm1 = pm2 + pm3 = = p2 + p1 ,

K
for m even,

4.1 Gaussian channels and beyond

391

and
pm = pm1 + pm2 = = p2 + p1 ,
pm + pm1 = pm2 + pm3 = = p3 + p2 = 2p1 ,

K
for m odd.

For small values of M = 2m the solution is straightforward. Viz., for M = 2

(two input signals at A with b = 2A): p1 = 1/2 and the maximising output-signal
PDF is

1/(4b), A y 3A,
fY (y) = 1/(2b), A y A,
yielding Cinf = (ln 2)/2.

1/(4b), 3A y A,
For M = 4 (four input signals at A, A/3, A/3, A, with b = 2A/3): p1 = 1/6,
p2 = 1/3, and the maximising output-signal PDF is

1/(6b), A y 5A/3 and 5A/3 y A,

fY (y) = 1/(4b), 2A/3 y A and A y 2A/3,

1/(6b), 2A/3 y 2A/3,

which yields Cinf = ln(61/2 41/3 /2).
For M = 6 (six input signals at A, 3A/5, A/5, A/5, 3A/5, A, with b =
2A/5): p1 = 1/6, p2 = 1/12, p3 = 1/4. Similarly, for M = 8 (eight input signals
at A, 5A/7, 3A/7, A/7, A/7, 3A/7, 5A/7, A, with b = 2A/7): p1 = 1/10,
p2 = 3/20, p3 = 1/20, p4 = 1/5.
In general, we can write all probabilities in terms of p1 . Viz., for m even:
pm = 2p1 ,
pm1 = p2 p1 ,
pm2 = 3p1 p2 ,
pm3 = 2(p2 p1 ),
pm4 = 4p1 2p2 ,
..
m .
1 (p2 p1 ),
p3 =
2
m+2
p1 ,
p2 =
m

392

Further Topics from Information Theory

whence
m+2
p1 ,
m
m2
p3 =
p1 ,
m
m+4
p4 =
p1 ,
m
m4
p1 ,
p5 =
m
..
.
2m 2
,
pm2 =
m
2
pm1 = ,
m
p2 =

(4.1.36)

pm = 2p1 ,
with p1 =

1
.
2(m + 1)

The corresponding PDF fY gives the value

1
1
h(Y ) = ln
2 4m(m + 1)b2

1
1
inf
ln 2.
and CA
= ln
2 4m(m + 1)

(4.1.37)

On the other hand, for a general odd m, the maximising input-signal distribution
PX has
m+1
,
2m(m + 1)
m1
,
p2 =
2m(m + 1)
m+3
p3 =
,
2m(m + 1)
m3
p4 =
,
2m(m + 1)
..
.
1
,
pm1 =
2m(m + 1)
m
.
pm =
m(m + 1)
p1 =

(4.1.38)

4.1 Gaussian channels and beyond

393

This yields the same answer for the maximum entropy and the restricted capacity:
1
1
h(Y ) = ln
2 4m(m + 1)b2

1
1
inf
ln 2.
and CA
= ln
2 4m(m + 1)

(4.1.39)

In future, we will refer to the input-signal distributions specified in (4.1.36) and

(A,2A/(2m1))
.
(4.1.38) as PX
Remark 4.1.15 It is natural to suggest that the above formulas give the maximum mutual information I(X : Y ) when (i) the noise random variable Z U(b, b)
is independent of X and (ii) the input-signal distribution PX is confined to [A, A]
with b = 2A/(2m 1), but otherwise is arbitrary (with I(X : Y ) again defined as in
(4.1.33)). A further-reaching (and more speculative) conjecture is about the maximiser under the above assumptions (i) and (ii) but for arbitrary A > b > 0, not
necessarily with A/b being integer or half-integer. Here number M = 2A/b + 1
will not be integer either, but remains worth keeping as a value of reference.
So when b decays from A/m to A/(m + 1) (or, equivalently, A grows from bm
to b(m + 1) and, respectively, M increases from 2m + 1 to 2m + 3), the maximiser
(A,b)
(bm,b)
(b(m+1),b)
evolves from PX
to PX
; at A = b(m + 1/2) (when M = 2(m + 1))
PX
(A,b)
(A,b)
may or may not coincide with the distribution PX
from
distribution PX
(4.1.36), (4.1.38).
To (partially) clarify the issue, consider the case where A/2 b A (i.e. 3
M 5) and assume that the input-signal distribution PX has
1
PX (A) = PX (A) = p and PX (0) = 1 2p where 0 p .
2

(4.1.40)

Then

1 p
1 2p
1
p
+ (A b)(1 2p) ln
hy(Y ) = Ap ln + (2b A)(1 p) ln
,
b
2b
2b
2b
(4.1.41)

and the equation dh(Y ) dp = 0 is equivalent to
pA = (1 p)2bA (1 2p)2(Ab) .

(4.1.42)

For b = A/2 this yields pA = (1 2p)A , i.e. p = 1 2p whence p = 1/3; similarly,

for b = A, p = 1/2. These coincide with previously obtained results. For b = 2A/3
we have that
pA = (1 p)A/3 (1 2p)2A/3 ;
i.e.
p3 = (1 p)(1 2p)2 .

(4.1.43a)

394

Further Topics from Information Theory

We are interested in the solution lying in (0, 1/2) (in fact, in (1/3, 1/2)). For b =
3A/4, the equation becomes
pA = (1 p)A/2 (1 2p)A/2 ,
i.e.

whence p = (3 5) 2.

p2 = (1 p)(1 2p),

(4.1.43b)

Example 4.1.16 It is useful to look at the example where the noise random
variable Z has two components: discrete and continuous. To start with, one could
try the case where
fZ (z) = q0 + (1 q) (z; 2 ),
i.e. Z = 0 with probability q and Z N(0, 2 ) with probability 1 q (0, 1). (So,
1 q gives the total probability of error.) Here, we consider the case
fZ = q0 + (1 q)

1
1(|z| b),
2b

and study the input-signal PMF of the form

PX (A) = p1 , PX (0) = p0 , PX (A) = p1 ,

(4.1.44a)

p1 , p0 , p1 0, p1 + p0 + p1 = 1,

(4.1.44b)

where

with b = A and M = 3 (three signal levels in (A, A)). The input-signal entropy is
h(X) = h(p1 , p0 , p1 ) = p1 ln p1 p0 ln p0 p1 ln p1 .
The output-signal PMF has the form

1
fY (y) = q p1 A + p0 0 + p1 A + (1 q)
2b

p1 1(2A y 0) + p0 1(A y A) + p1 1(0 y 2A)
and its entropy h(Y ) (calculated relative to the reference measure on R, whose
absolutely continuous component coincides with the Lebesgue and discrete component assigns value 1 to points A, 0 and A) is given by
h(Y ) = q ln q (1 q) ln(1 q) qh(p1 , p0 , p1 )

p1 + p0
p1
+ p1 + p0 ln
(1 q)A p1 ln
2A
2A

p0 + p1
p1
+ p1 ln
+ p0 + p1 ln
.
2A
2A

4.1 Gaussian channels and beyond

395

By symmetry, h(Y ) is maximised when p1 = p1 = p, p0 = 1 2p, and we have

to maximise, in q (0, 1), the expression

1 2p
p
+ (1 2p) ln
h(Y ) = h(q, 1 q) qh(p, p, 1 2p) (1 q)A 2p ln
,
2A
2A
for a given q (0, 1).
Differentiating yields
p
d
h(Y ) = 0
=
dp
1 2p

p
1 p

(1q)A/q
.

If (1 q)A/q > 1 this equation yields a unique solution which defines an optimal
input-signal distribution PX of the form (4.1.44a)(4.1.44b).
If we wish to see what value of q yields the maximum of h(Y ) (and hence, the
maximum information capacity), we differentiate in q as well:
q
d
h(Y ) = 0 log
= (A 1)h(p, p, 1 2p) 2A ln 2A.
dq
1q
If we wish to consider a continuously distributed input signal on [A, A], with a
PDF fX (x), then the output random variable Y = X + Z has the PDF given by the
convolution:
0
1 (y+b)A
fX (x)dx.
fY (y) =
2b (yb)(A)
1

The differential entropy h(Y ) = fY (y) ln fY (y)dy, in terms of fX , takes the form

0 (x+z+b)A
0
0 b
1 A
1

h(X + Z) =
fX (x)
ln
fX (x )dx dzdx.
2b A
2b (x+zb)(A)
b
The PDF fX minimising the differential entropy h(X + Z) yields a solution to

0 b 0 (x+z+b)A
1

fX (x )dx + fX (x)
ln
0=
2b (x+zb)(A)
b

1
0 (x+z+b)A

fX (x )dx
fX (x + z + b) fX (x + z b) dz.

(x+zb)(A)

An interesting question emerges when we think of a two-time-per-signal use of

a channel with a uniform noise. Suppose an input signal is represented by a point
x = (x1 , x2 ) in a plane R2 and assume as before that Z U(b, b), independently
of the input signal. Then the square Sb (x) = (x1 b, x1 + b) (x2 b, x2 + b), with
the uniform PDF 1/(4b2 ), outlines the possible positions of the output signal Y
given that X = (x1 , x2 ). Suppose that we have to deal with a finite input alphabet
A R2 ; then the output-signal domain is the finite union B = xA S(x). The
above argument shows that if we can find a subset A A such that squares Sb (x)

396

Further Topics from Information Theory

inf

C = ln 10

points from A (10 in total)

points from A \ A (8 in total)
an example of set B
set B
square Sb ( _x )
_x

Figure 4.3

with x A partition domain B (i.e. cover B but do not intersect each other) then,
for the input PMF Px with Px (x) = 1 ( A ) (a uniform distributionover A ), the

output-vector-signal PDF fY is uniform on B (that is, fY (y) = 1 area of B ).
Consequently, the output-signal entropy h(Y ) = ln area of B is attaining the
maximum over all input-signal PMFs Px with Px (A ) = 1 (and even attaining the
maximum over all input-signal PMFs Px with Px (B ) = 1 where B B is an
arbitrary subset with the property that x B S(x ) lies within B). Finally, the information capacity for the channel under consideration,
Cinf =

1 area of B
ln
nats/(scalar input signal).
2
4b2

See Figure 4.3.

To put it differently, any bounded set D2 R2 that can be partitioned into disjoint
squares of length 2b yields the information capacity
C2inf =

1 area of D2
ln
nats/(scalar input signal),
2
4b2

of an additive channel with a uniform noise over (b, b), when the channel is used
two times per scalar input signal and the random vector input x = (X1 , X2 ) is subject
to the regional constraint x D2 . The maximising input-vector PMF assigns equal
probabilities to the centres of squares forming the partition.
A similar conclusion holds in R3 when the channel is used three times for every
input signal, i.e. the input signal is a three-dimensional vector x = (x1 , x2 , x3 ), and
so on. In general, when we use a K-dimensional input signal x = (x1 , . . . , xk ) RK ,
and the regional constraint is x DK RK where DK is a bounded domain that can

4.2 The asymptotic equipartition property in continuous time setting

397

be partitioned into disjoint cubes of length 2b, the information capacity

CKinf =

1 volume of DK
ln
nats/(scalar input signal)
K
(2b)K

is achieved at the input-vector-signal PMF Px assigning equal masses to the centres

of the cubes forming the partition.
As K , the quantity CK may converge to a limit Cinf yielding the capacity
per scalar input signal under the sequence of regional constraint domains DK . A
trivial example of such a situation is where DK is a K-dimensional cube
SbK = (2bm, 2bm)K ;
then CKinf = ln(1 + m) does not depend on K (and the channel is memoryless).

4.2 The asymptotic equipartition property in the continuous

time setting
The errors of a wise man make your rule,
Rather than perfections of a fool.
William Blake (17571821), English poet
This section provides a missing step in the proof of Theorem 4.1.8 and additional Worked Examples. We begin with a series of assertions illustrating the
asymptotic equipartition property in various forms. The central facts are based on
the ShannonMcMillanBreiman (SMB) theorem which is considered a cornerstone of information theory. This theorem gives the information rate of a stationary
ergodic process X = (Xn ). Recall that a transformation of a probability space T is
called ergodic if every set A such that TA = A almost everywhere, satisfies P(A) = 0
or 1. For a stationary ergodic source with a finite expected value, Birkhoffs ergodic
theorem states the law of large numbers (with probability 1):
1 n
Xi EX.
n i=1

(4.2.1)

Typically, for a measurable function f (Xt ) of ergodic process,

1 n
f (Xi ) E f (X).
n i=1

(4.2.2)

Theorem 4.2.1 (ShannonMcMillanBreiman) For any stationary ergodic process X with finitely many values the information rate R = h, i.e. the limit in (4.2.3)

398

Further Topics from Information Theory

exists in the sense of the a.s. convergence and equals to entropy

1
log pX n1 X0n1 = h a.s.
0
n n

lim

(4.2.3)

The proof of Theorem 4.2.1 requires some auxiliary lemmas and is given at the
end of the section.
Worked Example 4.2.2 (A general asymptotic equipartition property) Given a
sequence of random variables
X1 , X2 , . . ., for all N = 1, 2, . . ., the distribution of
X1
..
N
the random vector x1 = . is determined by a PMF fxN (xN1 ) with respect to
1

XN
(N)
measure = (N factors). Suppose that the statement of the Shannon
McMillanBreiman theorem holds true:
1
log fxN (xN1 ) h in probability,
1
N

where h > 0 is a constant (typically, h = lim h(Xi )). Given > 0, consider the
i

typical set

1
..
N
N
N
S = x1 = . : log fxN (x1 ) + h .
1

The volume (N) (SN ) =

0
SN

ties:

(dx1 ) . . . (dxN ) of set SN has the following proper-

(N) (SN ) 2N(h+ ) , for all and N,

(4.2.4)

and, for 0 < < h and for all > 0,

(N) (SN ) (1 )2N(h ) , for N large enough, depending on .
Solution Since

P(RN )

fxN (xN1 )
1

N(h+ )

(dx j ) = 1, we have that

1 jN

(dx j )

1 jN

fxN (xN1 )

1 jN

(dx j ) = 2N(h+ ) (N) (SN ),

(4.2.5)

4.2 The asymptotic equipartition property in continuous time setting

399

giving the upper bound (4.2.4). On the other hand, given > 0, we can take N
large so that P(SN ) 1 , in which case, for 0 < < h,
1 P(SN )
0

fxN (xN1 )
1

N(h )

(dx j )

1 jN

SN 1 jN

(dx j ) = 2N(h ) (N) (SN ).

This yields the lower bound (4.2.5).

The next step is to extend the asymptotic equipartition property to joint distributions of pairs XN1 , YN1 (in applications, XN1 will play a role of an input and YN1
of an output of a channel). Formally, given two sequences of random variables,
., for
all N = 1, 2, .
. ., consider
the joint distribution of the
X1 , X2 , . . . and Y1 ,Y2 , . .

X1
Y1

random vectors XN1 = ... and YN1 = ... which is determined by a (joint)
XN
YN
PMF fxN ,YN with respect to measure (N) (N) where (N) = and
1

(N) = (N factors in both products). Let fXN1 and fYN1 stand for the
(joint) PMFs of vectors XN1 and YN1 , respectively.
As in Worked Example 4.2.2, we suppose that the statements of the Shannon
McMillanBreiman theorem hold true, this time for the pair (XN1 , YN1 ) and each of
XN1 and YN1 : as N ,

1
1
log fXN (XN1 ) h1 , log fYN (YN1 ) h2 ,
1
1
N
N
in probability,
1
N
N
log fXN ,YN (X1 , Y1 ) h,
1
1
N

where h1 , h2 and h are positive constants, with

h1 + h2 h;

(4.2.6)

typically, h1 = lim h(Xi ), h2 = lim h(Yi ), h = lim h(Xi ,Yi ) and h1 + h2 h =

lim I(Xi : Yi ). Given > 0, consider the typical set formed by sample pairs (xN1 , yN1 )

where

x1

xN1 = ...
xN

400

Further Topics from Information Theory

and

y1

yN1 = ... .

yN
Formally,

%
1
= (xN1 , yN1 ) : log fxN (xN1 ) + h1 ,
1
N
1
log fYN (yN1 ) + h2 ,
1
N
K
1
N N
log fxN ,YN (x1 , y1 ) + h ; (4.2.7)
1
1
N
N
by the above assumption we have that lim P T = 1 for all > 0. Next, define
TN

the volume of set TN :

(N)

N 0
T =

(N) (dxN1 ) (N) (dyN1 ).

Finally, consider an independent pair XN1 , YN1 where component XN1 has the same
PMF as XN1 and YN1 the same PMF as YN1 . That is, the joint PMF for XN1 and YN1
has the form
fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fYN (yN1 ).

(4.2.8)

Next, we assess the volume of set TN and then the probability that xN1 , YN1
1

TN .
Worked Example 4.2.3 (A general joint asymptotic equipartition property)
(I) The volume of the typical set has the following properties:

(N) (N) TN 2N(h+ ) , for all and N,
(4.2.9)

and, for all > 0 and 0 < < h, for N large enough, depending on ,

(N) (N) TN (1 )2N(h ) .

(II) For the independent pair XN1 , YN1 ,

P XN1 , YN1 TN 2N(h1 +h2 h3 ) , for all and N,
and, for all > 0, for N large enough, depending on ,

P XN1 , YN1 TN (1 )2N(h1 +h2 h+3 ) ,

for all .

(4.2.10)

(4.2.11)

(4.2.12)

4.2 The asymptotic equipartition property in continuous time setting

401

Solution (I) Completely follows the proofs of (4.2.4) and (4.2.5) with integration
of fxN ,YN .
1
1

(II) For the probability P XN1 , YN1 TN we obtain (4.2.11) as follows:
P

0

XN1 , YN1 TN =

fxN ,YN (dxN1 ) (dyN1 )

by definition
0

fxN (xN1 ) fYN (yN1 ) (dxN1 ) (dyN1 )

substituting (4.2.8)
2N(h1 ) 2N(h2 )

0
TN

(dxN1 ) (dyN1 )

according to (4.2.7)
2N(h1 ) 2N(h2 ) 2N(h+ ) = 2N(h1 +h2 h3 )
because of bound (4.2.9).
Finally, by reversing the inequalities in the last two lines, we can cast them as
2N(h1 + ) 2N(h2 + )

0
TN

(dxN1 ) (dyN1 )

according to (4.2.7)
(1 )2N(h1 + ) 2N(h2 + ) 2N(h ) = (1 )2N(h1 +h2 h+3 )
because of bound (4.2.10).
Formally, we assumed here that 0 < < h (since it was assumed in (4.2.10)), but
increasing only makes the factor 2N(h1 +h2 h+3 ) smaller. This proves bound
(4.2.12).
A more convenient (and formally a broader) extension of the asymptotic equipartition property is where we suppose that the statements of
the ShannonMcMillanBreiman
theorem
hold true directly for the ratio

N
N
N
N
fxN ,YN (X1 , Y1 ) fXN (X1 ) fYN (Y1 )) . That is,
1

fXN ,YN (XN1 , YN1 )

1
log 1 1N
c in probability,
N
fxN (X1 ) fYN (YN1 )
1

(4.2.13)

402

Further Topics from Information Theory

where c > 0 is a constant. Recall that fXN ,YN represents the joint PMF while fXN
1
1
1
and fxN individual PMFs for the random input and output vectors xN and YN , with
1

respect to reference measures (N) and (N) :

fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fch (yN1 |xN1 sent ),
0
1
1
1

fYN (YN1 ) = fXN ,YN (xN1 , yN1 ) (N) dxN1 .
1

Here, for > 0, we consider the typical set

%
K
fXN ,YN (xN1 , yN1 )
1
1
1

;
(4.2.14)
TN = (XN1 , yN1 ) : log
N
fXN (xN1 ) fYN (yN1 )
1
1
N N

by assumption (4.2.13) we have that lim P X1 , Y1 TN = 1 for all > 0.
N

Again, we will consider an independent pair XN1 , YN1 where component XN1
has the same PMF as XN1 and YN1 the same PMF as YN1 .
Theorem 4.2.4 (Deviation from the joint asymptotic equipartition property)

Assume that property (4.2.13) holds true. For an independent pair XN1 , YN1 , the

probability that XN1 , YN1 TN obeys
P

XN1 , YN1

2N(c ) , for all and N ,

and, for all > 0, for N large enough, depending on ,

P XN1 , YN1 TN (1 )2N(c+ ) , for all .

(4.2.15)

(4.2.16)

Proof Again, we obtain (4.2.15) as follows:

0

fXN ,YN N (dXN1 ) N (dyN1 )
P XN1 , YN1 TN =
TN

fxN (xN1 ) fYN (yN1 ) (dXN1 ) (dyN1 )

1
1

0
fXN ,YN (xN1 , yN1 )
exp 1 N1
=
fXN (x1 ) fYN (yN1 )
TN
=

N(c )

2N(c ) .

fXN ,YN (xN1 , yN1 ) N (dxN1 ) N (dyN1 )

fXN ,YN (xN1 , yN1 ) (dxN1 ) (dyN1 )

1
1

N N

X1 , Y1 TN

4.2 The asymptotic equipartition property in continuous time setting

403

The first equality is by definition, the second step follows by substituting (4.2.8),
the third is by direct calculation, and the fourth because of the bound (4.2.14).
Finally, by reversing the inequalities in the last two lines, we obtain the bound
(4.2.16):
2N(c+ )

fXN ,YN (xN1 , yN1 ) (dxN1 ) (dyN1 )

1
1
N N

N(c+ )
=2
P X1 , Y1 TN 2N(c+ ) (1 ),
TN

the first inequality following because of (4.2.14).

Worked Example 4.2.5 Let x = {X(1), . . . , X(n)}T be a given vector/collection
of random variables. Let us write x(C) for subcollection {X(i) : i C} where C is
a non-empty subset in the index set {1, . . . , n}. Assume that the joint distribution
for any subcollection x(C) with C = k, 1 k n, is given by a joint PMF fx(C)
relative to measure (k factors, each corresponding
to a random variable

x(1)

X(i) with i C). Similarly, given a vector x = ... of values for x, denote by
x(n)
x(C) the argument {x(i) : i C} (the sub-column in x extracted by picking the rows
with i C). By the Gibbs inequality, for all partitions {C1 , . . . ,Cs } of set {1, . . . , n}
into non-empty disjoint subsets C1 , . . . ,Cs (with 1 s n), the integral
0

fx (x) log

fxn1 (x)
(dx( j)) 0.
fx(C1 ) (x(C1 )) . . . fx(Cs ) (x(Cs )) 1
jn

(4.2.17)

What is the partition for which the integral in (4.2.17) attains its maximum?
Solution The partition in question has s = n subsets, each consisting of a single
point. In fact, consider the partition of set {1, . . . , n} into single points; the corresponding integral equals
0

fx (x) log

fxn1 (x)
(dx( j)).
fXi (xi ) 1 jn

(4.2.18)

1in

Let {C1 , . . . ,Cs } be any partition of {1, . . . , n}. Multiply and divide the fraction
under the log by the product of joint PMFs fx(Cl ) (x(Cl )). Then the integral
1is

(4.2.18) is represented as the sum

fx (x) log

fxn1 (x)
(dx( j)) + terms 0.
fx(Ci ) (x(Ci )) 1 jn

1is

The answer follows.

404

Further Topics from Information Theory

Worked Example 4.2.6 Let x = {X(1), . . . , X(n)} be a collection of random

variables as in Worked Example 4.2.5, and let Y be another random variable.
Suppose that there exists a joint PMF fx,Y , relative to a measure (n) where
(n) = (n times). Given a subset C {1, . . . , n}, consider the sum

I(x(C) : Y ) + E I(x(C : Y )|x(C) .

Here x(C) = {X(i) : i C}, x(C) = {X(i) : i C}, and E I(x(C : Y )|x(C) stands
for the expectation of I(x(C : Y ) conditional on the value of x(C). Prove that this
sum does not depend on the choice of set C.
Solution Check that the expression in question equals I(x : Y ).
In Section 4.3 we need the following facts about parallel (or product) channels.
Worked Example 4.2.7 (Lemma A in [173]; see also [174]) Show that the
capacity of the product of r time-discrete Gaussian channels with parameters
( j , p( j) , 2j ) equals

j
p( j)
ln 1 +
.
(4.2.19)
C=
j 2j
1 jr 2

Moreover, (4.2.19) holds when some of

the j s equal +: in this case the corresponding summand takes the form p( j) 2j .
Solution Suppose that multi-vector data x = {x1 , . . . , xr } are transmitted
via r parx j1
..
allel channels of capacities C1 , . . . ,Cr , where each vector x j = . Rn j . It is
x jn j
convenient to set n j = j where . It is claimed that the capacity for this
product-channel equals the sum Ci . By induction, it is sufficient to consider
1ir

the case r = 2. For the direct part, assume that R < C1 +C2 and > 0 are given. For
sufficiently large we must find a code for the product channel with M = eR codewords and Pe < . Set = (C1 +C2 R)/2. Let X 1 and X 2 be codes for channels
1 and 2 respectively with M1 e(C1 ) and M2 e(C2 ) and error-probabilities
1
2
PeX , PeX /2. Construct a concatenation code X with codewords x = xk1 xl2
where xi X i , i = 1, 2. Then, for the product-channel under consideration, with
1
2
codes X 1 and X 2 , the error-probability PeX , X is decomposed as follows:

1
2
1
P error in channel 1 or 2| xk1 xl2 sent .
PeX , X =

M1 M2 1kM1 ,1lM2
By independence of the channels, PeX
direct part.

1, X 2

PeX + PeX which yields the

4.2 The asymptotic equipartition property in continuous time setting

405

The proof of the inverse is more involved and we present only a sketch, referring
the interested reader to [174]. The idea is to apply the so-called list decoding:
suppose we have a code Y of size M and a decoding rule d = d Y . Next, given
that a vector y has been received at the output port of a channel, a list of L possible
Y , and the
code-vectors from Y has to be produced, by using a decoding rule d = dlist
decoding (based on rule d) is successful if the correct word is in the list. Then, for
the average error-probability Pe = PeY (d) over code Y , the following inequality is
satisfied:
Pe Pe ( d ) PeAV (L, d)

(4.2.20)

where the error-probability Pe (d) = PeY (d) refers to list decoding and PeAV (L, d) =
PeAV (Y , L, d) stands for the error-probability under decoding rule d averaged over
all subcodes in Y of size L.
Now, going back to the product-channel with marginal capacities C1 and C2 ,
choose R > C1 +C2 , set = (R C1 C2 )/2 and let the list size be L = eRL , with
RL = C2 + . Suppose we use a code Y of size eR with a decoding rule d and a
list decoder d with the list-size L. By using (4.2.20), write
Pe Pe (d)PeAV (eRL , d)

(4.2.21)

and use the facts that RL > C2 and the value PeAV (eRL , d) is bounded away from
zero. The assertion of the inverse part follows from the following observation discussed in Worked Example 4.2.8. Take R2 < R RL and consider subcodes L Y
of size L = eR2 . Suppose we choose subcode L at random, with equal probabilities. Let M2 = eR2 and PeY ,M2 (d) stand for the mean error-probability averaged
over all subcodes L Y of size L = eR2 . Then
Pe (d) PeY ,M2 (d) + ( )

(4.2.22)

where ( ) 0 as .
Worked Example 4.2.8 Let L = eRL and M = eR . We aim to show that if
R2 < R RL and M2 = eR2 then the following holds. Given a code X of size M ,
a decoding rule d and a list decoder d with list-size L, consider the mean errorprobability PeX ,M2 (d) averaged over the equidistributed subcodes S X of size
S = M2 . Then PeX ,M2 (d) and the list-error-probability PeX (d) satisfy
PeX (d) PeX ,M2 (d) + ( )

(4.2.23)

where ( ) 0 as .
Solution Let X , S and d be as above and suppose we use a list decoder d with
list-length L.

406

Further Topics from Information Theory

Given a subcode S X with M2 codewords, we will use the following decoding. Let L be the output of decoder d. If exactly one element x j S belongs
to L , the decoder for S will declare x j . Otherwise, it will pronounce an error.
Denote the decoder for S by d S . Thus, given that xk S was transmitted, the
resulting error-probability, under the above decoding rule, takes the form
Pek = p(L |xk )ES (L |xk )
L

where p(L |xk ) is a probability of obtaining the output L after transmitting xk

under the rule d X and ES (L |xk ) is the error-probability for d S . Next, split
1 (L |x ) + E 2 (L |x ) where E 1 (L |x ) stands for the probabilES (L |xk ) = ES
k
k
k
S
S
2 (L |x ) for the probability that word x L was decoded
ity that xk L and ES
k
k
by a wrong code-vector from S (both probabilities conditional upon sending xk ).
2 (L |x ) is split into a sum of (conditional) probabilities E (L , x |x )
Further, ES
j k
k
S
that the decoder returned vector x j L with j = k.
Let PeS (d) = PeS , AV (d) denote the average error-probability for subcode S .
The above construction yields

1
PeS (d)
p(L |xk ) ES1 (L |xk ) + ES (L , x j |xk ) . (4.2.24)
M2 k: x
j=k
k S L
Inequality (4.2.24) is valid for any subcode S . We now select S at random from
X choosing each subcode of size M2 with equal probability. After averaging over
all such subcodes we obtain a bound for the averaged error-probability PeX ,M2 =
PeX ,M2 (d):
PeX ,M2 PeX (d) +

C
DX ,M2
1 M2
p(L
|x
)E
(L
,
x
|x
)
j k
k

M2 k=1
L j=k

(4.2.25)

I JX ,M2
means the average over all selections of subcodes. As x j and xk
where
are chosen independently,
DX ,M2 C
DX ,M2 C
DX ,M2
C
= p(L |xk )
.
E2 (L , x j )
p(L |xk )E 2 (L , x j )
Next,
C

p(L |xk )

DX ,M2

C
DX ,M2
1
L
p(L |x), E2 (L , x j |xk )
= ,
M
M

and we obtain
PeX ,M2 PeX (d) +

1
L
1 M2
p(L
|x)
M
M
M2 k=1
L xX
j=k

4.2 The asymptotic equipartition property in continuous time setting

407

which implies
PeX ,M2 PeX (d) +

M2 L
.
M

(4.2.26)

Since M2 L/M = eR2 e(RRL ) 0 when as R2 < RRL , inequality (4.2.23)

is proved.
We now give the proof of Theorem 4.2.1. Consider the sequence of kth-order
Markov approximations of a process X, by setting

n1

i1
p(k) X0n1 = pX k1 X0k1 p Xi |Xik
.

(4.2.27)

1
1
)
= h(X0 |Xk
H (k) = E log p X0 |Xk

(4.2.28)

1
1
H = E log p X0 |X
).
= h(X0 |X

(4.2.29)

i=k

Set also

and

The proof is based on the following three results: Lemma 4.2.9 (the sandwich
lemma), Lemma 4.2.10 (a Markov approximation lemma) and Lemma 4.2.11 (a
no-gap lemma).
Lemma 4.2.9

Proof

For any stationary process X,

p(k) (X0n1 )
1
0 a.s.,
lim sup log
p(X0n1 )
n n

(4.2.30)

p(X0n1 )
1
lim sup log
0 a.s.
1
p(X0n1 |X
)
n n

(4.2.31)

If An is a support event for pX n1 (i.e. P(X0n1 An ) = 1), write

(k) n1
p(k) (X0n1 )
n1 p (x0 )
p(x
)
=

0
p(X0n1 )
p(x0n1 )
xn1 A
n

p(k) (x0n1 )

x0n1 An

= p(k) (A) 1.

408

Further Topics from Information Theory

1
Similarly, if Bn = Bn (X
) is a support event for pX n1 |X 1 (i.e. P(X0n1
0

1
) = 1), write
Bn |X

p(X0n1 )
p(x0n1 )
n1 1
p(x
|X
)
=
E
1

0
X
1
1
p(X0n1 |X
)
p(x0n1 |X
)
xn1 B
n

= EX 1

p(x0n1 )

x0n1 Bn

= EX 1 P(Bn ) 1.

By the Markov inequality,

p(k) (X0n1 )
p(k) (X0n1 )
1
1
1
log
tn = P
logtn ,
P
n
tn
tn
p(X0n1 )
p(X0n1 )

p(X0n1 )
tn . Letting tn = n2 so that 1/tn < and
and similarly for P
1
n
p(X0n1 |X
)
using the BorelCantelli lemma completes the proof.
Lemma 4.2.10

For a stationary ergodic process X,

a.s.
1
log p(k) (X0n1 ) H (k) ,
n

(4.2.32)

a.s.
1
1
) H.
log p(X0n1 |X
n

(4.2.33)

1
1
) and f = log p(X0 |X
) into
Proof Substituting f = log p(X0 |Xk
Birkhoffs ergodic theorem (see for example Theorem 9.1 from [36]) yields

a.s.
1
1 n1
1
i1
) 0 + H (k)
log p(k) (X0n1 ) = log p(X0k1 ) log p(k) (Xi |Xik
n
n
n i=k
(4.2.34)
and
a.s.
1 n1
1
1
i1
) = log p(Xi |X
) H,
log p(X0n1 |X
n
n i=0

(4.2.35)

respectively.
So, by Lemmas 4.2.9 and 4.2.10,
1
1
1
1
lim log (k) n1 = H (k) ,
lim sup log
n1
p(X0 ) n n
p (X0 )
n n

(4.2.36)

4.3 The NyquistShannon formula

409

and
1
1
1
1
lim inf log
lim log
= H, )
n1
n1 1
n n
n
n
p(X0 )
p(X0 |X )
which we rewrite as
1
1
H lim inf log p(X0n1 ) lim sup log p(X0n1 ) H (k) .
n
n
n
n

Lemma 4.2.11

(4.2.37)

For any stationary process X, H (k) H = H.

Proof The convergence H (k) H follows by stationarity and by conditioning

not to increase entropy. It remains to show that H (k) H, so that H = H. The
DoobLevy martingale convergence theorem for conditional probabilities yields
a.s.

1
1
, k .
p X0 = x0 |Xk p X0 = x0 |X

(4.2.38)

As the set of values I is supposed to be finite, and the function p [0, 1] p log p
is bounded, the bounded convergence theorem gives that as k ,
H (k) = E

1
1
) log p(X0 = x0 |Xk
)
p(X0 = x0 |Xk

x0 I

1
1
p(X0 = x0 |X
) log p(X0 = x0 |X
) = H.

4.3 The NyquistShannon formula

In this section we give a rigorous derivation of the famous NyquistShannon formula1 for the capacity of a continuous-time channel with the power constraint and
a finite bandwidth, the result broadly considered an ultimate fact of information
theory. Our exposition follows (with minor deviations) the paper [173]. Because
it is quite long, we divide the section into subsections, each of which features a
particular step of the construction.
Harry Nyquist (18891976) is considered a pioneer of information theory whose
works, together with those of Ralph Hartley (18881970), helped to create the
concept of the channel capacity.
1

Some authors speak in this context of a ShannonHartley theorem.

410

Further Topics from Information Theory

The setting is as follows. Fix numbers , , p > 0 and assume that every seconds
a coder produces a real code-vector

x1
..
x= .
xn
where n = . All vectors x generated by the coder lie in a finite set X = Xn
Rn of cardinality M 2Rb = eRn (a codebook); sometimes we write, as before,
XM,n to stress the role of M and n. It is also convenient to list the code-vectors
from X as x(1), . . . , x(M) (in an arbitrary order) where

x1 (i)

x(i) = ... , 1 i M.
xn (i)
Code-vector x is then converted into a continuous-time signal
n

x(t) = xi i (t), where 0 t ,

(4.3.1)

i=1

by using an orthonormal basis in 2 [0, ] formed by functions i (t), i = 1, 2, . . .

1
(with 0 i (t) j (t)dt = i j ). Then the entry xi can be recovered by integration:
0

xi =

x(t)i (t)dt.

(4.3.2)

The instantaneous signal power at time t is associated with |x(t)|2 ; then the square1
norm ||x||2 = 0 |x(t)|2 dt = |xi |2 will represent the full energy of the signal in
1in

the interval [0, ]. The upper bound on the total energy spent on transmission takes
the form

||x||2 p , or x Bn ( p ).
(4.3.3)
(In the theory of waveguides, the dimension n is called the Nyquist number and the
value W = n/(2 ) /2 the bandwidth of the channel.)

The code-vector x(i) is sent through an additive channel, where the receiver gets
the (random) vector

Y1
..
(4.3.4)
Y = . where Yk = xk (i) + Zk , 1 k n.
Yn

4.3 The NyquistShannon formula

411

The assumption we will adopt is that

Z1
..
Z= .
Zn
is a vector with IID entries Zk N(0, 2 ). (In applications, engineers use the rep1
resentation Zi = 0 Z(t)i (t)dt, in terms of a white noise process Z(t).)

From the start we declare that if x(i) X \ Bn ( p ), i.e. ||x(i)||2 > p , the
output signal vector Y is rendered non-decodable. In other words, the probability
of correctly decoding the output vector Y = x(i) + Z with ||x(i)||2 > p is taken to
be zero (regardless of the fact that the noise vector Z can be small and the output
vector Y close to x(i), with a positive probability).
Otherwise, i.e. when ||x(i)||2 p , the receiver applies, to the output vector Y,
a decoding rule d(= dn,X ), i.e. a map y K d(y) X where K Rn is a
decodable domain (where map d had been defined). In other words, if Y K
then vector Y is decoded as d(Y) X . Here, an error arises either if Y K or if
d(Y) = x(i) given that x(i) was sent. This leads to the following formula for the
probability of erroneously decoding the input code-vector x(i):

1,

Pe (i, d) =
Pch Y K or d(Y) = x(i)|x(i) sent ,

||x(i)||2 > p ,
||x(i)||2 p .

(4.3.5)

The average error-probability Pe = PeX ,av (d) for the code X is then defined by
Pe =

1
Pe (i, d).
M 1iM

(4.3.6)

Furthermore, we say that Rbit (or Rnat ) is a reliable transmission rate (for given
and p) if for all > 0 we can specify 0 ( ) > 0 such that for all > 0 ( )
there exists a codebook X of size X eRnat and a decoding rule d such that
Pe = PeX ,av (d) < . The channel capacity C is then defined as the supremum of all
reliable transmission rates, and the argument from Section 4.1 yields
C=

p

ln 1 +
(in nats);
2
2

cf. (4.1.17). Note that when , the RHS in (4.3.7) tends to p/(2 2 ).

(4.3.7)

412

Further Topics from Information Theory

.
..
...
..
0
.......
... ..
....... ... .. . . .. .. . .
T
. .. .. . .. . . .
..
..
. . . .. .... .
.
.
.
.
..
. .. . . . . .
... ... .
.. .
.. ..
.....
....
.
.
Figure 4.4

In the time-continuous set-up, Shannon (and Nyquist before him) discussed an

application of formula (4.3.7) to band-limited signals. More precisely, set W =
/2; then the formula

p
C = W ln 1 + 2
(4.3.8)
20 W
should give the capacity of the time-continuous additive channel with white noise
of variance 2 = 02W , for a band-limited signal x(t) with the spectrum in [W,W ]
and of energy per unit time p.
This sentence, perfectly clear to a qualified engineer, became a stumbling point
for mathematicians and required a technically involved argument for justifying its
validity. In engineers language, an ideal orthonormal system on [0, ] to be used
in (4.3.1) would be a collection of n 2W equally spaced -functions. In other
words, it would have been very convenient to represent the code-vector x(i) =
(x1 (i), . . . , xn (i)) by a function fi (t), of the time argument t [0, ], given by the
sum

k
fi (t) = xk (i) t
(4.3.9)
2W
1kn
where n = 2W (and = 2W ). Here (t) represents a unit impulse appearing
near time 0 and graphically visualised as an acute unit peak around point t =
0. Then the shifted function (t k/(2W )) yields a peak concentrated near t =
k/(2W ), and the graph of function fi (t) is shown in Figure 4.4.
We may think that our coder produces functions xi (t) every seconds, and each
such function is the result of encoding a message i. Moreover, within each time
interval of length , the peaks xk (i) (t k/(2W )) appear at time-step 1/(2W ).
Here (t k/(2W )) is the time-shifted Dirac delta-function.
The problem is that (t) is a so-called generalised function, and 2 . A way
to sort out this difficulty is to pass the signal through a low-frequency filter. This
produces, instead of fi (t), the function fi (t)(= fW,i (t)) given by

4.3 The NyquistShannon formula

fi (t) =

413

xk (i) sinc (2Wt k) .

(4.3.10)

sin ( (2Wt k))

(2Wt k)

(4.3.11)

1kn

Here
sinc (2Wt k) =

is the value of the shifted and rescaled (normalised) sinc function:

sin( s) , s = 0,
sinc(s) =
s R,
s
1,
s = 0,

(4.3.12)

featured in Figure 4.5.

The procedure of removing high-frequency harmonics (or, more generally, highresolution components) and replacing the signal fi (t) with its (approximate) lowerresolution version fi (t) is widely used in modern computer graphics and other areas
of digital processing.
Example 4.3.1 (The Fourier transform in 2 ) Recall that the Fourier transform
1
F of an integrable function (i.e. a function with | (x)|dx < +) is defined by
0

F ( ) = (x)ei x dx, R.

(4.3.13)

The inverse Fourier transform can be written as an inverse map:

0
1

1
( )ei x d .
F (x) =
2

(4.3.14)

A profound fact is that (4.3.13) and (4.3.14) can be extended to square-integrable

1
functions 2 (R) (with 2 = | (x)|2 dx < +). We have no room here to
go into detail; the enthusiastic reader is referred to [127]. Moreover, the Fouriertransform techniques turn out to be extremely useful in numerous applications. For
instance, denoting F = and writing F1 = , we obtain from (4.3.13), (4.3.14)
that

(x) =

1
2

( )eix d .

(4.3.15)

In addition, for any two square-integrable functions 1 , 2 2 (R),

1 (x)2 (x)dx =

1 ( )2 ( )d .

(4.3.16)

414

Further Topics from Information Theory

K(ts)
0.0

0.2

0.4

0.6

W=0.5
W=1
W=2

4
2
0

2
4

Figure 4.5

4.3 The NyquistShannon formula

415

Furthermore, the Fourier transform can be defined for generalised functions too;
see again [127]. In particular, the equations similar to (4.3.13)(4.3.14) for the
delta-function look like this:
0
0
1
i t
(t) =
e
d , 1 = (t)eit dt,
(4.3.17)
2
implying that the Fourier transform of the Dirac delta is ( ) 1. For the shifted
delta-function we obtain

0
1
k
=
t
eik /(2W ) ei t d .
2W
2

(4.3.18)

The ShannonNyquist formula is established for a device where the channel is

preceded by a filter that cuts off all harmonics eit with frequencies outside
the interval [2 W, 2 W ]. In other words, a (shifted) unit impulse (t k/(2W ))
in (4.3.18) is replaced by its cut-off version which emerges after the filter cuts off
the harmonics eit with | | > 2 W .
The sinc function (a famous object in applied mathematics) is a classical function
arising when we reduce the integral in in (4.3.17) to the interval [ , ]:
1
sinc(t) =
2

i t

d , 1[ , ] ( ) =

sinc(t)eit dt,

t, R1 (4.3.19)

(symbolically, function sinc = F1 1[ , ] ). In our context, the function t

A sinc(At) can be considered, for large values of parameter A > 0, as a convenient approximation for (t). A customary caution is that sinc(t) is not an integrable function on the whole axis R (due to the 1/t factor), although it is square2
1
integrable:
sinct dt < . Thus, the right equation in (4.3.19) should be understood in an 2 -sense.
However, it does not make the mathematical and physical aspects of the theory
less tricky (as well as engineering ones). Indeed, an ideal filter producing a clear cut
of unwanted harmonics is considered, rightly, as physically unrealisable. Moreover, assuming that such a perfect device is available, we obtain a signal fi (t) that
is no longer confined to the time interval [0, ] but is widely spread in the whole
time axis. To overcome this obstacle, one needs to introduce further technical approximations.
Worked Example 4.3.2 Verify that the functions

t 2 W sinc (2Wt k) , k = 1, . . . , n,

are orthonormal in the space 2 (R1 ):

[sinc (2Wt k)] sinc 2Wt k dt = kk .

4 W

(4.3.20)

416

Further Topics from Information Theory

Solution The shortest way to see this is to write the Fourier-decomposition (in
2 (R)) implied by (4.3.19):

1
2 W sinc (2Wt k) =
2 W

0 2 W
2 W

eik /(2W ) eit d

(4.3.21)

and check that the functions representing the Fourier-transforms

1 (| | 2 W )eik /(2W ) , k = 1, . . . , n,
2 W
are orthonormal. That is,
1
4 W

0 2 W
2 W

ei(kk ) /(2W ) d = kk

where

kk =

k = k ,

(4.3.22)

is the Kronecker symbol. But (4.3.22) can be verified by a standard integration.

Since functions in (4.3.20) are orthonormal, we obtain that

||x(i)||2 = (4 W )|| fi ||2 , where || fi ||2 =

| fi (t)|2 dt,

(4.3.23)

and functions fi have been introduced in (4.3.10). Thus, the power constraint can
be written as
|| fi ||2 p /4 W = p0 .

(4.3.24)

In fact, the coefficients xk (i) coincide with the values fi (k/(2W )) of function fi
calculated at time points k/(2W ), k = 1, . . . , n; these points can be referred to as
sampling instances.
Thus, the input signal fi (t) develops in continuous time although it is completely
specified by its values fi (k/(2W )) = xk (i). Thus, if we think that different signals
are generated in disjoint time intervals (0, ), ( , 2 ), . . ., then, despite interference
caused by infinite tails of the function sinc(t), these signals are clearly identifiable
through their values at sampling instances.
The NyquistShannon assumption is that signal fi (t) is transformed in the channel into
g(t) = fi (t) + Z (t).

(4.3.25)

4.3 The NyquistShannon formula

417

Here Z (t) is a stationary continuous-time Gaussian process with the zero mean
(EZ (t) 0) and the (auto-)correlation function

(4.3.26)
E Z (s)Z (t + s) = 202W sinc(2Wt), t, s R.
In particular, when t is a multiple of /W (i.e. point t coincides with a sampling
instance), the random variables Z (s) and Z (t + s) are independent. An equivalent
form of this condition is that the spectral density
( ) :=

eit E Z (0)Z (t) dt = 02 1 | | < 2 W .

(4.3.27)

We see that the

continuous-time signal y(t) can be identified through

received
k
via equations
its values yk = y
2W

k
are IID N(0, 202W ).
yk = xk (i) + Zk where Zk = Z
2W
This corresponds to the system considered in Section 4.1 with p = 2W p0 and
2 = 202W . It has been generally believed in the engineering community that
the capacity C of the current system is given by (4.3.8), i.e. the transmission rates
below this value of C are reliable and above it they are not.

However, a number of problems are to be addressed, in order to understand formula

(4.3.8) rigorously. One is that, as was noted above, a shar filter band-limiting the
signal to a particular frequency interval is an idealised device. Another is that the
output signal g(t) in (4.3.25) can be reconstructed after it has been recorded over a
small time interval because any sample function of the form
t R

(xk (i) + zk ) sinc (2Wt k)

(4.3.28)

1kn

is analytic in t. Therefore, the notion of rate should be properly re-defined.

The simplest solution (proposed in [173]) is to introduce a class of functions
A ( ,W, p0 ) which are
(i) approximately band-limited to W cycles per a unit of time (say, a second),
(ii) supported by a time interval of length (it will be convenient to specify this
interval as [ /2, /2]),
(iii) have the total energy (the 2 (R)-norm) not exceeding p0 .
These restrictions determine the regional constraints upon the system.

418

Further Topics from Information Theory

Thus, consider a code X of size M eR , i.e. a collection of functions

f1 (t), . . . , fM (t), of a time variable t. If a given code-function fi A ( ,W, p0 ), it
is declared non-decodable: it generates an error with probability 1. Otherwise, the
signal fi A ( ,W, p0 ) is subject to the additive Gaussian noise Z (t) with mean
EZ (t) 0 characterised by (4.3.27) and is transformed to g(t) = fi (t) + Z (t), the
signal at the output port of the channel (cf. (4.3.25)). The receiver uses a decoding
rule, i.e. a map d : K X where K is, as earlier, the domain of definition of d,
i.e. some given class of functions where map d is defined. (As before, the decoder
d may vary with the code, prompting the notation d = d X .) Again, if g K, the
transmission is considered as erroneous. Finally, if g K then the received signal g
is decoded by the code-function d X (g)(t) X . The probability of error for code
X when the code-signal generated by the coder was fi X is set to be
'
1,
fi A ( ,W, p0 ),
(4.3.29)
Pe (i) =

c
X
Pch K {g : d (g) = fi } , fi A ( ,W, p0 ).
The average error-probability Pe = PeX ,av (d) for code X (and decoder d) equals
Pe =

1
Pe (i, d).
M 1iM

(4.3.30)

Value R(= Rnat ) is called a reliable transmission rate if, for all > 0, there exists
and a code X of size M eR such that Pe < .

Now fix a value (0, 1). The class A ( ,W, p0 ) = A ( ,W, p0 , ) is defined as
the set of functions f (t) such that
(i) f = D f where
D f (t) = f (t)1(|t| < /2), t R,
1

and f (t) has the Fourier transform eit f (t)dt vanishing for | | > 2 W ;
(ii) the ratio
|| f ||2
1 ;
|| f ||2
and
(iii) the norm || f ||2 p0 .
In other words, the transmittable signals f A ( ,W, p0 , ) are sharply localised in time and nearly band-limited in frequency.

4.3 The NyquistShannon formula

419

The NyquistShannon formula can be obtained as a limiting case from several

assertions; the simplest one is Theorem 4.3.3 below. An alternative approach will
be presented later in Theorem 4.3.7.
Theorem 4.3.3 The capacity C = C( ) of the above channel with constraint
domain A ( ,W, p0 , ) described in conditions (i)(iii) above is given by

C = W ln 1 +

As 0,

p0
202W

C( ) W ln 1 +

p0
.
1 02

p0
202W

(4.3.31)

(4.3.32)

which yields the NyquistShannon formula (4.3.8).

Before going to (quite involved) technical detail, we will discuss some facts relevant to the product, or parallel combination, of r time-discrete Gaussian channels.
(In essence, this model was discussed at the end of Section 4.2.) Here, every time
units, the input signal is generated, which is an ordered collection of vectors

( j)
x1

+
* (1)
..
nj

(4.3.33)
x , . . . , x(r) where x( j) =
. R , 1 j r,
( j)
xn j
and n j = j with j being a given value (the speed of the digital production
from coder j). For each vector x( j) we consider a specific power constraint:
O O2
O ( j) O
Ox O p( j) , 1 j r.
The output signal is a collection of (random) vectors

( j)
Y
)
(
1.
( j)
( j)
( j)
.
Y(1) , . . . , Y(r) where Y( j) =
. and Yk = xk + Zk ,
( j)
Yn j

(4.3.34)

(4.3.35)

2
( j)
( j)
with Zk being IID random variables, Zk N 0, ( j) , 1 k n j , 1 j r.

420

Further Topics from Information Theory

A codebook X with information rate R, for the product-channel under consideration, is an array of M input signals,
(

x(1) (1), . . . , x(r) (1)

(1)
x (2), . . . , x(r) (2)
(4.3.36)
...
...
...
)
(1)

x (M), . . . , x(r) (M) ,
each of which has the same structure as in (4.3.33).*As before, a +
decoder d is a map
acting on a given set K of sample output signals y(1) , . . . , y(r) and taking these
signals to X .
As above, for i = 1, . . . , M,we define the error-probability
Pe (i, d) for code X

(1)
(r)
when sending an input signal x (i), . . . , x (i) :

2

Pe (i, d) = 1,
if x( j) (i) p( j) for some j = 1, . . . , r,
and

+
Y(1) , . . . , Y(r) K or
+
+ *
*
d Y(1) , . . . , Y(r) = x(1) (i), . . . , x(r) (i)
*
+
| x(1) (i), . . . , x(r) (i) sent ,
2

if x( j) (i) < p( j) for all j = 1, . . . , r.

Pe (i, d) = Pch

The average error-probability Pe = PeX ,av (d) for code X (while using decoder d)
is then again given by
1
Pe =
Pe (i, d).
M 1iM
As usual, R is said to be a reliable transmission rate if for all > 0 there exists a
0 > 0 such that for all > 0 there exists a code X of cardinality M eR and a
decoding rule d such that Pe < . The capacity of the combined channel is again defined as the supremum of all reliable transmission rates. In Worked Example 4.2.7
the following fact has been established (cf. Lemma A in [173]; see also [174]).
Lemma 4.3.4

The capacity of the product-channel under consideration equals

j
p( j)
.
(4.3.37)
ln 1 +
C=
j 2j
1 jr 2

Moreover, (4.3.37) holds when some of

the2 j equal +: in this case the corre(
j)
sponding summand takes the form p 2 j .

4.3 The NyquistShannon formula

421

Our next step is to consider jointly constrained products of time-discrete Gaussian

channels. We discuss the following types of joint constraints.
Case I. Take r = 2, assume 12 = 22 = 02 and replace condition (4.3.34) with
x(1) 2 + x(2) 2 < p0 .

(4.3.38a)

In addition, if 1 2 , we introduce (0, 1) and require that

x(2) 2 x(1) 2 + x(2) 2 .

(4.3.38b)

Otherwise, i.e. if 2 1 , formula (4.3.38b) is replaced by

x(1) 2 x(1) 2 + x(2) 2 .

(4.3.38c)

Case II. Here we take r = 3 and assume that 12 = 22 32 and 3 = +. The

requirements are now that

x( j) 2 < p0

(4.3.39a)

1 j3

and
x(3) 2

x( j) 2 .

(4.3.39b)

1 j3

Case III. As in Case I, take r = 2 and assume 12 = 22 = 02 . Further, let 2 = +.

The constraints now are

and

x(1) 2 < p0

(4.3.40a)

x(2) 2 < x(1) 2 + x(2) 2 .

(4.3.40b)

Worked Example 4.3.5 (cf. Theorem 1 in [173]). We want to prove that the
capacities of the above combined parallel channels of types IIII are as follows.

Case I, 1 2 :

(1 )p0
1
2
p0
+
C=
ln 1 +
ln 1 +
2
2
1 02
2 02

where

= min ,

2
.
1 + 2

(4.3.41a)

(4.3.41b)

422

Further Topics from Information Theory

If 2 1 , subscripts 1 and 2 should replace each other in these equations. Further,

when i = +, one uses the limiting expression lim ( /2) ln(1 + v/ ) = v/2. In

particular, if 1 < 2 = + then = , and the capacity becomes

(1 )p0
1
p0
+ 2.
ln 1 +
C=
2
2
1 0
20

(4.3.41c)

This means that the best transmission rate is attained when one puts as much energy into channel 2 as is allowed by (4.3.38b).
Case II:

Case III:

(1 )p0
1
ln 1 +
C=
2
(1 + 2 )12

2
p
(1 )p0
ln 1 +
+
+ 2.
2
2
( 1 + 2 ) 1
23

p0
1
p0
ln 1 +
C=
.
+
2
2
1 0
2(1 )02

(4.3.42)

(4.3.43)

Solution We present the proof for Case I only. For definiteness, assume that 1 <
2 . First, the direct part. With p1 = (1 )p0 , p2 = p0 , consider the parallel
combination of two channels, with individual power constraints on the input signals
x(1) and x(2) :
O O2
O O2
O (1) O
O O
(4.3.44a)
Ox O p1 , Ox(2) O p2 .
Of course, (4.3.44a) implies (4.3.38a). Next, with , condition (4.3.38b) also
holds true. Then, according to the direct part of Lemma 4.3.4, any rate R with
R < C1 (p1 ) +C2 (p2 ) is reliable. Here and below,

q

ln 1 +
(4.3.44b)
C (q) =
, = 1, 2.
2
02
This implies the direct part.
A longer argument is needed to prove the inverse. Set C = C1 (p1 ) + C2 (p2 ).
The aim is to show that any rate R > C is not reliable. Assume the opposite: there
exists such a reliable R = C + ; let us recall what it formally means. There exists
a sequence of values (l) and (a) a sequence of codes
(
(
)
)
X (l) = x(i) = x(1) (i), x(2) (i) , 1 i M (l)

4.3 The NyquistShannon formula

423

+
of size M (l) e
composed of combined code-vectors x(i) = x(1) (i), x(2) (i)
with square-norms x(i)2 = x(1) (i)2 + x(2) (i)2 , and (b) a sequence of decoding maps d (l) : y K(l) d (l) (y) X (l) such that Pe 0. Here, as before,
R (l)

Pe = PeX

(l) ,av

(d (l) ) stands for the average error-probability:

Pe =

1
Pe (i, d (l))
M (l) 1iM(l)

calculated from individual error-probabilities Pe (i, d (l) ):

2
(l)
(2)
2
2

1, if x(i) > p0 or x (i) > x(i) ,

Pe (i, d (l) ) = Pch Y K(l) or d (l) (Y) = x(i)| x(i) sent ,

if x(i)2 p (l) and x(2) (i)2 x(i)2 .

The component vectors

(1)

x1 (i)

..
R1 (l)
x(1) (i) =
.

(1)
x (l) (i)
1

(1)
x1 (i)

..
R2 (l)
and x(2) (i) =
.

(2)
x (l) (i)
2

are sent through their respective parts of the parallel-channel combination, which
results in output vectors
(1)
(2)

Y1 (i)
Y1 (i)

..
..
R1 (l) , Y(2) =
R2 (l)
Y(1) =
.
.

(1)
(2)
Y (l) (i)
Y (l) (i)
1

*
+
forming the combined output signal Y = Y(1) , Y(2) . The entries of vectors Y(1)
and Y(2) are sums
(1)

Yj
(1)

where Z j

(2)

and Zk

(1)

(2)

= x j (i) + Z j , Yk

(2)

= xk (i) + Zk ,

are IID, N(0, 02 ) random variables. Correspondingly, Pch

(1)

(2)

refers to the joint distribution of the random variables Y j and Yk , 1 j

1 (l) , 1 k 2 (l) .
Observe that function q C1 (q) is uniformly continuous in q on [0, p0 ]. Hence,
we can find an integer J0 large enough such that

C1 (q) C1 q p0 < , for all q (0, p0 ).

J0 2

424

Further Topics from Information Theory

(l)

Then we partition the code X (l) into J0 classes (subcodes) X j , j = 1, . . . , J0 : a

(l)
code-vector x(1) (i), x(2) (i) falls in class X j if
( j 1)

p0
p0
(2)
<
(xk )2 j
.

J0
J0
1k (l)

(4.3.45a)

Since a transmittable code-vector x has a component x(2) with x(2) 2 x2 ,

each such x lies in one and only one class. (We make an agreement that zero
(l)
(l)
code-vectors belong to X1 .) The class X j containing the most code-vectors

(l)
(l)
is denoted by X . Then, obviously, the cardinality X M (l) J0 , and the
(l)

transmission rate R of code X satisfies

R R

1
ln J0 .
(l)

(4.3.45b)
(l)

On the other hand, the maximum error-probability for subcode X is not larger
than for the whole code X (l) (when using the same decoder d (l) ); consequently,
(l)

(l)
the error-probability PeX ,av d (l) Pe 0.
Having a fixed number J0 of classes in the partition of X (l) , we can find at
least one j0 {1, . . . , J0 } such that, for infinitely many l, the most numerous class
(l)
(l)
X coincides with X j . Reducing our argument to those l, we may assume that

(l)
(l)
(l)
X = X j0 for all l. Then, for all x(1) , x(2) X , with

(i)
x1

x(i) =
. , i = 1, 2,
(i)
xni
using (4.3.38a) and (4.3.45a)

O O2
O O2
( j0 1)
j0
O (1) O
(l) O (2) O

,
p0 (l) .
p
x
Ox O
O O
0
J0
J0
(
)
(l)
That is,
X , d (l)
is a coder/decoder sequence for the standard parallelchannel combination (cf. (4.3.34)), with

( j0 1)
j0
p0 .
p1 = 1
p0 and p2 =
J0
J0
(l)

As the error-probability PeX ,av d (l) 0, rate R is reliable for this combination
of channels. Hence, this rate does not surpass the capacity:

( j0 1)
j0
1
p0 +C2
p0 .
R C1
J0
J0

4.3 The NyquistShannon formula

425

Here and below we refer to the definition of Ci (u) given in (4.3.44b), i.e.
R C1 ((1 )p0 ) +C2 ( p0 ) +

(4.3.46)

where = j0 /J0 .
Now note that, for 2 1 , the function

C1 ((1 )p0 ) +C2 ( p0 )

increases in when < 2 /(1 + 2 ) and decreases when > 2 /(1 + 2 ).
Consequently, as = j0 /J0 , we obtain that, with = min [ , 2 /(1 + 2 )],
C1 ((1 )p0 ) +C2 ( p0 ) C1 (p1 ) +C2 (p2 ) = C .

(4.3.47)

In turn, this implies, owing to (4.3.45b), (4.3.46) and (4.3.47), that

R C +

+ (l) ln J0 , or R C + when (l) .

2
2

The contradiction to R = C + yields the inverse.

Example 4.3.6
(Prolate spheroidal wave functions (PSWFs); see [146],
[90], [91]) For any given ,W > 0 there exists a sequence of real functions

0 1 (t), 2 (t), . . ., of a variable t R, belonging to the Hilbert space 2 (R) (i.e. with

n (t)2 dt < ), called prolate spheroidal wave functions (PSWFs), such that

n ( ) =
(a) The Fourier transforms

(t)eit dt vanish for | | > 2W ; more-

over, the functions n (t) form an orthonormal basis in the Hilbert subspace
formed by functions from 2 (R) with this property.
(b) The functions n (t) := n (t)1(|t| < /2) (the restrictions of n (t) to
( /2, /2)) are pairwise orthogonal:
0

n (t)n (t)dt

0 /2

n (t)n (t)dt = 0 when n = n .

(4.3.48a)

Furthermore, functions n form a complete system in 2 ( /2, /2): if a

function 2 ( /2, /2) has

0 /2

(t)n (t)dt = 0 for all n 1 then

(t) = 0 in 2 ( /2, /2).

0 /2
/2

n (s) sinc 2W (t s) ds.

(4.3.48b)

426

Further Topics from Information Theory

That is, functions n (t)0 are the eigenfunctions, with the eigenvalues n , of the

integral operator

(s)K( , s) ds with the integral kernel

K(t, s) = 1(|s| < /2)(2W ) sinc 2W (t s)
sin(2 W (t s))
, /2 s;t /2.
= 1(|s| < /2)
(t s)
(d) The eigenvalues n satisfy the condition

n =

0 /2
/2

n (t)2 dt with 1 > 1 > 2 > > 0.

An equivalent formulation
can be given in terms involving the Fourier trans0
forms [Fn ] ( ) =
1
2

n (t)eit dt:

0 2 W
2 W

| [Fn ] ( ) |2 d

|n (t)|2 dt = n ,

which means that n gives a frequency concentration for the truncated function n .
(e) It can be checked that functions n (t) (and hence numbers n ) depend on W
and through the product W only. Moreover, for all (0, 1), as W ,

2W (1 ) 1, and 2W (1+ ) 0.

(4.3.48c)

That is, for large, nearly 2W of values n are close to 1 and the rest are close
to 0.

An important part of the argument that is currently developing is the Karhunen

Lo`eve decomposition. Suppose Z(t) is a Gaussian random process with spectral
density ( ) given by (4.3.27). The KarhunenLo`eve decomposition states that
for all t ( /2, /2), the random variable Z(t) can be written as a convergent (in
the mean-square sense) series
Z(t) =

An n (t),

(4.3.49)

where 1 (t), 2 (t), . . . are the PSWFs discussed in Worked Example 4.3.9 below
and A1 , A2 , . . . are IID random variables with An N(0, n ) where
n are the corresponding eigenvalues. Equivalently, one writes Z(t) = n1 n n n (t) where
n N(0, 1) IID random variables.
The proof of this fact goes beyond the scope of this book, and the interested
reader is referred to [38] or [103], p. 144.

4.3 The NyquistShannon formula

427

The idea of the proof of Theorem 4.3.3 is as follows. Given W and , an input
signal s (t) from A ( ,W, p0 , ) is written as a Fourier series in the PSWFs n .
In this series, the first 2W summands represent the part of the signal confined
between the frequency band-limits 2 W and the time-limits /2. Similarly, the
noise realisation Z(t) is decomposed in a series in functions n . The action of the
continuous-time channel is then represented in terms of a parallel combination of
two jointly constrained discrete-time Gaussian channels. Channel 1 deals with the
first 2W PSWFs in the signal decomposition and has 1 = 2W . Channel 2 receives
the rest of the expansion and has 2 = +. The power constraint s2 p0 leads
to a joint constraint, as in (4.3.38a). In addition, a requirement emerges that the
energy allocated outside the frequency band-limits 2 W or time-limits /2 is
small: this results in another power constraint, as in (4.3.38b). Applying Worked
Example 4.3.5 for Case I results in the assertion of Theorem 4.3.3.
To make these ideas precise, we first derive Theorem 4.3.7 which gives an alternative approach to the NyquistShannon formula (more complex in formulation
but somewhat simpler in the (still quite lengthy) proof).
Theorem 4.3.7 Consider the following modification of the model from Theorem
4.3.3. The set of allowable signals A2 ( ,W, p0 , ) consists of functions t R
s(t) such that
(1) s2 =

|s(t)|2 dt p0 ,

(2) the Fourier transform [Fs]( ) = s(t)eit dt vanishes when | | > 2 W , and

0 /2
(3) the ratio
|s(t)|2 dt s2 > 1 . That is, the functions s
/2

A ( ,W, p0 , ) are sharply band-limited in frequency and nearly localised

in time.

The noise process is Gaussian, with the spectral density vanishing when | | > 2W
and equal to 02 for | | 2W .
Then the capacity of such a channel is given by

p0
p0
+ 2.
C = C = W ln 1 + (1 ) 2
(4.3.50)
20 W
20
As 0,

C W ln 1 +

p0
202W

yielding the NyquistShannon formula (4.3.8).

428

Further Topics from Information Theory

Proofof Theorem 4.3.7 First, we establish the direct half. Take

(1 )p0
p0
+ 2
R < W ln 1 +
2
20 W
20

(4.3.51)

and take (0, 1) and (0, min [ , 1 ]) such that R is still less than

( )p0
(1 + )p0

+
.
(4.3.52)
C = W (1 ) ln 1 +
2
20 W (1 )
202
According to Worked Example 4.3.5, C is the capacity of a jointly constrained
discrete-time pair of parallel channels as in Case I, with

1 = 2W (1 ), 2 = +, = , p = p0 , 2 = 02 ;

(4.3.53)

cf. (4.3.41a). We want to construct codes and decoding rules for the timecontinuous version of the channel,
asymptotically vanishing probability
(1) (2)yielding

is an allowable input signal for the parallel
of error as . Assume x , x
pair of discrete-time channels with parameters given in (4.3.53). The input for the
time-continuous channel is the following series of (W, ) PSWFs:
s(t) =

1k1

(1)

xk k (t) +

(2)

xk k+1 (t).

(4.3.54)

1k<

The first fact to verify is that the signal in (4.3.54) belongs to A2 ( ,W, p0 , ), i.e.
satisfies conditions (1)(3) of Theorem 4.3.7.
To check property (1), write
2
2 O O2 O O2
O O O O
(1)
(2)
+
x
s2 =

xk = Ox(1) O + Ox(2) O p0 .
k
1k1

1k<

Next, the signal s(t) is band-limited, inheriting this property from the PSWFs
k (t). Thus, (2) holds true.
A more involved argument is needed to establish property (3). Because the
PSWFs k (t) are orthogonal in 2 [ /2, /2] (cf. (4.3.48a)), and using the monotonicity of the values n (cf. (4.3.48b)), we have that

0 /2
(1 D )s||2
2
|s(t)| dt s2 =
1
||s||2
/2
2
2
(1)
(2)
(1 k ) xk
(1 k+1 ) xk
=
OO (1) OO2 OO (2) OO2 + OO (1) OO2 OO (2) OO2
+ x
+ x
x
1k<
1k1 x
O (1) O2
O (2) O2

Ox O
Ox O
1 1 O O2 O O2 + O O2 O O2 .
Ox(1) O + Ox(2) O
Ox(1) O + Ox(2) O

4.3 The NyquistShannon formula

429

Now, as , the value 1 1 (see (4.3.48c)). With the ratio

O (1) O2 O (1) O2 O (2) O2
Ox O + Ox O 1, we have that for large enough,
Ox O
O (1) O2

Ox O
1 1 O O2 O O2 .
Ox(1) O + Ox(2) O
O O2 O (1) O2 O (2) O2
Ox O + Ox O (referring to (4.3.38b)). This
Next, the ratio Ox(2) O
finally yields

0 /2
(1 D )s||2
2
|s(t)| dt s2 =
+ = ,
1
||s||2
/2
i.e. property (3).
Further, the noise can be expanded in accordance with KarhunenLo`eve:
Z(t) =

(1)

1k1

Zk k (t) +

(2)

Zk k+1 (t).

(4.3.55)

1k<
( j)

Here again, k (t) are the PSWFs and IID random variables Zk N(0, k ). Correspondingly, the output signal is written as

Y (t) =

(1)

1k1

Yk k (t) +

(2)

Yk k+1 (t)

(4.3.56)

1k<

where
( j)

( j)

= xk + Zk , j = 1, 2, k 1.

(4.3.57)

So, the continuous-time channel is equivalent to a jointly constrained parallel combination. As we checked, the capacity equals C specified in (4.3.52). Thus, for
R < C we can construct codes of rate R and decoding rules such that the errorprobability tends to 0.
For the converse, assume that there exists a sequence (l) , a sequence of
(l)
transmissible domains A2 ( (l) ,W, p0 , (l) ) described in (1)(3) and a sequence
(l)
of codes X (l) of size M = eR where

p0
(1 )p0
+ 2 .
R > W ln 1 +
2
2W 0
0
As usual, we want to show that the error-probability PeX ,av (d (l) ) does not tend to
0.
As before, we take > 0 and (0, 1 ) to ensure that R > C where

p0
(1 )
p0

+
.
C = W (1 + ) ln 1 +
2
(1 ) 2W 0 (1 + )
(1 )02
(l)

430

Further Topics from Information Theory

Then, as in the argument on the direct half, C is the capacity of the type I jointly
constrained parallel combination of channels with

=
, 2 = 02 , p = p0 , 1 = 2W (1 + ), 2 = +.
(4.3.58)
1
(l)

Let s(t) X (l) A2 ( (l) ,W, p0 , (l) ) be a continuous-time code-function.

Since the PSWFs k (t) form an ortho-basis in 2 (R), we can decompose
s(t) =

1k1 (l)

(1)

xk k (t) +

1k<

(2)

xk k+1 (l) (t),t R.

(4.3.59)

We want to show that the discrete-time signal x = x(1) , x(2) represents an allowable input to the type I jointly constrained parallel combination specified in
(4.3.38ac). By orthogonality of PSWFs k (t) in 2 (R) we can write
x2 = ||s||2 p0 (l)
ensuring that condition (4.3.38a) is satisfied. Further, using orthogonality of PSW
functions k (t) in 2 ( /2, /2) and the fact that the eigenvalues k decrease
monotonically, we obtain that

0 (l) /2
(1 D (l) ) s2
|s(t)|2 dt s2 =
1
||s||2
(l) /2
2

2
(1)
(2)
(1 k ) xk
1 k+1 (l) xk
=
+

x2
x2
1k1 (l)
1k<
O
O
2
Ox(2) O

1 1 (l)
.
x2
By virtue of (4.3.48c), 1 (l) for l large enough. Moreover, since 1

0 (l)
/2

(l) /2

|s(t)|2 dt

s2 , we can write

O (2) O2
Ox O
2

and deduce property (4.3.38b).

Next, as in the direct half, we again use the KarhunenLo`eve decomposition
of noise Z(t) to deduce that for each code for the continuous-time channel there
corresponds a code for the jointly constrained parallel combination of discrete-time
channels, with the same rate and error-probability. Since R is > C , the capacity
(l)
of the discrete-time channel, the error-probability PeX ,av (d (l) ) remains bounded
away from 0 as l . This yields the converse.

4.3 The NyquistShannon formula

431

Proof of Theorem 4.3.3 (Sketch) The formal argument proceeds as in Theorem

4.3.7: we have to prove the direct and converse parts of the theorem. Recall that the
direct part states that the capacity is C, the value indicated in (4.3.31), while the
converse/inverse that it is C. For the direct part, the channel is decomposed into
the product of two parallel channels, as in Case III, with

1 = 2W (1 ), 2 = +, p = p0 , 2 = 02 , = ,

(4.3.60)

where (0, 1) (cf. property (e) of PSWFs in Example 4.3.6) and (0, ) are
auxiliary values.
For the converse half we use the decomposition into two parallel channels, again
as in Case III, with

1 = 2W (1 + ), 2 = +, p = p0 , 2 = 02 , =

.
1

(4.3.61)

Here, as before, value (0, 1) emerges from property (e) of PSWFs, whereas
value (0, 1).

Summing up our previous observations we obtain the famous

Lemma 4.3.8 (The NyquistShannonKotelnikovWhittaker sampling lemma)
1
Let f be a function t R f (t) R with | f (t)|dt < +. Suppose that the
Fourier transform
0
[F f ]( ) =

eit f (t)dt

vanishes for | | > 2W . Then, for all x R, function f can be uniquely reconstructed from its values f (x + n/(2W )) calculated at points x + n/(2W ), where
n = 0, 1, 2. More precisely, for all t R,
n sin [2 (Wt n)]
.
(4.3.62)
f (t) = f
2W
2 (Wt n)
nZ1
By the famous uncertainty principle of quantum
Worked Example 4.3.9
physics, a function and its Fourier transform cannot be localised simultaneously
in finite intervals [ , ] and [2W, 2W ]. What could be said about the case
when both function and its Fourier transform are nearly localised? How can we
quantify the uncertainty in this case?
Solution Assume the function f 2 (R) and let f = F f L2 (R) be the Fourier
transform of f . (Recall that space 2 (R) consists of functions f on R with || f ||2 =

432
1

Further Topics from Information Theory

| f (t)|2 dt < + and that for all f , g 2 (R), the inner product
finite.) We shall see that if
0
0 t0 + /2
2
| f (t)| dt
| f (t)|2 dt = 2
t0 /2

and

0 2 W
2 W

f (t)g(t)dt is

|F f ( )| d

|F f ( )|2 d = 2

(4.3.63)

(4.3.64)

then W , where = ( , ) will be found explicitly. (The inequality will be

sharp, and functions yielding equality will be specified.)
Consider the linear operators f 2 (R) D f 2 (R) and f 2 (R) B f
2 (R) given by
D f (t) = f (t)1(|t| /2)
and
B f (t) =

1
2

0 2 W
2 W

F f ( )ei t d =

f (s)

sin 2 W (t s)
ds.
t s

(4.3.65)

(4.3.66)

We are interested in the product of these operators, A = BD:

A f (t) =

0 /2
/2

f (s)

sin 2 W (t s)
ds;
t s

(4.3.67)

see Example 4.3.6. The eigenvalues n of A obey 1 > 0 > 1 > and tend to zero
as n ; see [91]. We are interested in the eigenvalue 0 : it can be shown that 0
is a function of the product W . In fact, the eigenfunctions ( j ) of (4.3.67) yield an
orthonormal basis in 2 (R); at the same time these functions form an orthogonal
basis in 2 [ /2, /2]:
0 /2
/2

j (t)i (t)dt = i i j .

As usual, the angle between f and g in Hilbert space 2 (R) is determined by

0
1
1
Re f (t)g(t)dt .
( f , g) = cos
(4.3.68)
|| f || ||g||
The angle between two subspaces is the minimal angle between vectors in these
subspaces. We will show that there exists a positive angle (B, D) between the
subspaces B and D, the image spaces of operators B and D. That is, B is the
linear subspace of all band-limited functions while D is that of all time-limited
functions. Moreover,
$
(B, D) = cos1 0
(4.3.69)

4.3 The NyquistShannon formula

433

and inf f B,gD ( f , g) is achieved when f = 0 , g = D0 where 0 is the (unique)

eigenfunction with the eigenvalue 0 .
To this end, we verify that for any f B
min ( f , g) = cos1
gD

||D f ||
.
|| f ||

(4.3.70)

1
Indeed,
expand
f
=
f

D
f
+
D
f
and
observe
that
the
integral
f (t)

D f (t) g(t)dt = 0 (since the supports of g and f D f are disjoint). This implies
that
0
0

0

Re f (t)g(t)dt f (t)g(t)dt = D f (t)g(t)dt .

Hence,
1
Re
|| f ||||g||

f (t)g(t)dt

||D f ||
|| f ||

which implies (4.3.70), by picking g = D f .

Next, we expand f = an n , relative to the eigenfunctions of A. This yields

n=0

the formula
cos1

||D f ||
= cos1
|| f ||

n |an |2 n
n |an |2

1/2
.

(4.3.71)

The supremum of the RHS in f is achieved when an = 0 for n 1, and f = 0 .

We conclude that there exists the minimal angle between subspaces B and D, and
this angle is achieved on the pair f = 0 , g = D0 , as required.
Next, we establish
Lemma 4.3.10 There exists a function f 2 such that || f || = 1, ||D f || = and
||B f || = if and only if and fall in one of the following cases (a)(d):
(a)
(b)
(c)
(d)

= 0 and0 < 1;
0 < < 0 < 1 and 0 1;

1 + cos1 cos1 ;
0 < 1 and cos
0

= 1 and 0 < 0 .

Proof Given [0, 1], let G ( ) be the family of functions f L2 with norms
|| f || = 1 and ||D f || = . Next, determine ( ) := sup f G ( ) ||B f ||.
(a) If = 0, the family G (0) can contain no function with = B f = 1. Furthermore, if D f = 0 and B f = 1 for f B then f is analytic and f (t) = 0 for
|t| < /2, implying f 0. To show that G (0) contains functions with all values of

n Dn
[0, 1), we set fn =
. Then the norm ||B fn || = 1 n . Since there
1 n

434

Further Topics from Information Theory

exist eigenvalues n arbitrarily close to zero, ||B fn || becomes arbitrarily close to 1.

By considering the functions eipt f (t) we can obtain all values of between points

1 n since
||Be

ipt

f || =

p+ W

p W

1/2
|Fn ( | d
2

The norm ||Beipt f || is continuous in p and approaches 0 as p . This completes

the analysis of case (a).

(b) When 0 < < 0 < 1, we set

$
$
2 n 0 0 2 n

,
f=
0 n
for n large when the eigenvalue n is close to 0. We have that f B, || f || = ||B f || =
1, while a simple computation shows that ||D f || = . This includes the case = 1
as, by choosing eipt f (t) appropriately, we can obtain any 0 < < 1.

(c) and (d) If 0 < 1 we decompose f G ( ) as follows:

f = a1 D f + a2 B f + g

(4.3.72)

with g orthogonal to both D f and B f . Taking the inner product of the sum in the
RHS of (4.3.72), subsequently, with f , D f , B f and g we obtain four equations:
1 = a1 2 + a2 2 +

2 = a1 2 + a2
0

2 = a1

g(t) f (t)dt,

B f (t)Dg(t)dt,

D f (t)B f (t)dt + a2 2 ,

f (t)g(t)dt = g2 .

These equations imply

2 + 2 1 + ||g||2 = a1

D f (t)B f (t)dt + a2

B f (t)D f (t)dt.

4.3 The NyquistShannon formula

435

By eliminating g(t) f (t)dt, a1 and a2 we find, for = 0,

2 =

1 2 ||g||2
( 2

B f (t)D f (t)dt)

+ 1

which is equivalent to

2Re
2

2
0

1 2 ||g||2

2 ( 2

B f (t)D f (t)dt)

B f (t)D f (t)dt

D f (t)B f (t)dt

D f (t)B f (t)dt

2
0

1
2
+ (1 2 2 D f (t)B f (t)dt

2
0

1
2
||g|| 1 2 2 D f (t)B f (t)dt .

(4.3.73)
(4.3.74)

In terms of the angle , we can write

cos = Re

0

D f (t)B f (t)dt D f (t)B f (t)dt .

Substituting into (4.3.74) and completing the square we obtain

( cos )2 (1 2 ) sin2
0

with equality if and only if g = 0 and the integral

cos1 0 , (4.3.75) implies that

cos1 + cos1 cos1

(4.3.75)

D f (t)B f (t)dt is real. Since

$
0 .

(4.3.76)

The locus of points ( , ) satisfying (4.3.76) is up and to the right of the curve
where
$
(4.3.77)
cos1 + cos1 = cos1 0 .
See Figure 4.6.
Equation (4.3.77) holds for the function f = b1 0 + b2 D0 with
R
R
1 2

1 2
and b2 =
.
b1 =
1 0
1 0
0
All intermediate values of are again attained by employing eipt f .

Further Topics from Information Theory

0.2

0.4

beta^2

0.6

0.8

1.0

436

0.0

W=0.5
W=1
W=2
0.0

0.2

0.4

0.6

0.8

1.0

alpha^2

Figure 4.6

4.4 Spatial point processes and network information theory

For a discussion of capacity of distributed systems and construction of random
codebooks based on point processes we need some background. Here we study the
spatial Poisson process in Rd , and some more advanced models of point process
are introduced with a good code distance. This section could be read independently
of PSE II, although some knowledge of its material may be very useful.
Definition 4.4.1 (cf. PSE II, p. 211) Let be a measure on R with values (A)
for measurable subsets A R. Assume that is (i) non-atomic and (ii) -finite,
i.e. (i) (A) = 0 for all countable sets A R and (ii) there exists a partition R =
j J j of R into pairwise disjoint intervals J1 , J2 , . . . such that (J j ) < . We say
that a random counting measure M defines a Poisson random measure (PRM, for
short) with mean, or intensity, measure if for all collection of pairwise disjoint
intervals I1 , . . . , In on R, the values M(Ik ), k = 1, . . . , n, are independent, and each
M(Ik ) Po( (Ik )).

4.4 Spatial point processes and network information theory

437

We will state several facts, without proof, about the existence and properties of
the Poisson random measure introduced in Definition 4.4.1.
Theorem 4.4.2 For any non-atomic and -finite measure on R+ there exists
a unique PRM satisfying Definition 4.4.1. If measure has the form (dt) =
dt where > 0 is a constant (called the intensity of ), this PRM is a Poisson
process PP( ). If the measure has the form (dt) = (t)dt where (t) is a given
function, this PRM gives an inhomogeneous Poisson process PP( (t)).
Theorem 4.4.3 (The mapping theorem) Let be a non-atomic and -finite
measure on R such that for all t 0 and h > 0, the measure (t,t + h) of the
interval (t,t + h) is positive and finite (i.e. the value (t,t + h) (0, )), with
lim (0, h) = 0 and (R+ ) = lim (0, u) = +. Consider the function
h0

f : u R+ (0, u),

and let f 1 be the inverse function of f . (It exists because f (u) = (0, u) is strictly
monotone in u.) Let M be the PRM( ). Define a random measure f M by
( f M)(I) = M( ( f 1 I)) = M( ( f 1 (a), f 1 (b))),

(4.4.1)

for interval I = (a, b) R+ , and continue it on R. Then f M PP(1), i.e. f M

yields a Poisson process of the unit rate.
We illustrate the above approach in a couple of examples.
Worked Example 4.4.4 Let the rate function of a Poisson process = PP( (x))
on the interval S = (1, 1) be

(x) = (1 + x)2 (1 x)3 .

Show that has, with probability 1, infinitely many points in S, and that they
can be labelled in ascending order as
X2 < X1 < X0 < X1 < X2 <

with X0 < 0 < X1 .

Show that there is an increasing function f : S R with f (0) = 0 such that the
points f (X)(X ) form a Poisson process of unit rate on R, and use the strong
law of large numbers to show that, with probability 1,
lim (2n)1/2 (1 Xn ) =

Find a corresponding result as n .

1
.
2

(4.4.2)

438

Further Topics from Information Theory

Solution Since

0 1
1

(x)dx = ,

there are with probability 1 infinitely many points of in (1, 1). On the other
hand,
0 1

(x)dx <

for every > 0, so that (1+ , 1 ) is finite with probability 1. This is enough
to label uniquely in ascending order the points of . Let
0 x

f (x) =
0

(y)dy.

As f : S R is increasing, f maps into a Poisson process whose mean measure

is given by

(a, b) =

0 f 1 (b)
f 1 (a)

(x)dx = b a.

With this choice of f , the points ( f (Xn )) form a Poisson process of unit rate on R.
The strong law of large numbers shows that, with probability 1, as n ,
n1 f (Xn ) 1, and n1 f (Xn ) 1.
Now, observe that
1
1
(x) (1 x)3 and f (x) (1 x)2 , as x 1.
4
8
Thus, as n , with probability 1,
1
n1 (1 Xn )2 1,
8
which is equivalent to (4.4.2). Similarly,
1
1
(x) (1 + x)2 and f (x) (1 + x)1 , as x 1,
8
8
implying that with probability 1, as n ,
1
n1 (1 + Xn )1 1.
8
Hence, with probability 1
1
lim n(1 + Xn ) = .
n
8

4.4 Spatial point processes and network information theory

439

Worked Example 4.4.5 Show that, if Y1 < Y2 < Y3 < are points of a Poisson
process on (0, ) with constant rate function , then
lim Yn /n =

with probability 1. Let the rate function of a Poisson process = PP( (x)) on
(0, 1) be
(x) = x2 (1 x)1 .
Show that the points of can be labelled as
< X2 < X1 <

1
< X0 < X1 <
2

and that
lim Xn = 0 , lim Xn = 1 .

Prove that
lim nXn = 1

with probability 1. What is the limiting behaviour of Xn as n +?

Solution The first part again follows from the strong law of large numbers. For the
second part we set
0 x

f (x) =
1/2

( )d ,

and use the fact that f maps into a PP of constant rate on ( f (0), f (1)): f () =
PP(1). In our case, f (0) = and f (1) = , and so f () is a PP on R. Its points
may be labelled
< Y2 < Y1 < 0 < Y0 < Y1 <
with
lim Yn = , lim Yn = +.

Then Xn = f 1 (Yn ) has the required properties.

The strong law of large numbers applied to Yn gives
lim

Now, as x 0,
f (x) =

0 1/2
x

f (Xn )
Yn
= lim
= 1, a.s.
n n
n

(1 ) d

0 1/2
x

2 d x1 ,

440

Further Topics from Information Theory

implying that
1
Xn
= 1, i.e. lim nXn = 1, a.s.
n n
n

lim

Similarly,
lim

and as x 1,
f (x)

0 x
1/2

f (Xn )
= 1, a.s.,
n

(1 )1 d ln(1 x).

This implies that

lim

ln(1 Xn )
= 1, a.s.
n

Next, we discuss the concept of a Poisson random measure (PRM) on a general

set E. Formally, we assume that E had been endowed with a -algebra E of subsets,
and a measure assigning to every A E a value (A), so that if A1 , A2 , . . . are
pairwise disjoint sets from E then

(n An ) = (An ).
n

The value (E) can be finite or infinite. Our aim is to define a random counting
measure M = (M(A), A E ), with the following properties:
(a) The random variable M(A) takes non-negative integer values (including, possibly, +). Furthermore,
'
Po( (A)), if (A) < ,
(4.4.3)
M(A)
= + with probability 1, if (A) = .
(b) If A1 , A2 , . . . E are disjoint sets then
M (i Ai ) = M(Ai ).

(4.4.4)

(c) The random variables M(A1 ), M(A2 ), . . . are independent if sets A1 , A2 , . . . E

are disjoint. That is, for all finite collections of disjoint sets A1 , . . . , An E and
non-negative integers k1 , . . . , kn
P (M(Ai ) = ki , 1 i n) =

1in

P (M(Ai ) = ki ) .

(4.4.5)

4.4 Spatial point processes and network information theory

441

First assume that (E) < (if not, split E into subsets of finite measure). Fix a
random variable M(E) Po( (E)). Consider a sequence X1 , X2 , . . . of IID random points in E, with Xi (E), independently of M(E). It means that for all
n 1 and sets A1 , . . . , An E (not necessarily disjoint)
n n

(Ai )
(E) (E)
,
P M(E) = n, X1 A1 , . . . , Xn An = e

n!
i=1 (E)

(4.4.6)

and conditionally,
n

(Ai )
P X1 A1 , . . . , Xn An |M(E) = n =
.
i=1 (E)

(4.4.7)

Then set
M(E)

M(A) =

1(Xi A), A E .

(4.4.8)

i=1

Theorem 4.4.6 If (E) < , equation (4.4.8) defines a random measure M on E

satisfying properties (a)(c) above.
Worked Example 4.4.7 Let M be a Poisson random measure of intensity on
the plane R2 . Denote by C(r) the circle {x R2 : |x| < r} of radius r in R2 centred
at the origin and let Rk be the largest radius such that C(Rk ) contains precisely k
points of M . [Thus C(R0 ) is the largest circle about the origin containing no points
of M , C(R1 ) is the largest circle about the origin containing a single point of M ,
and so on.] Calculate ER0 , ER1 and ER2 .
Solution Clearly,
P(R0 > r) = P(C(r) contains no point of M) = e r , r > 0,
2

and
P(R1 > r) = P(C(r) contains at most one point of M)
2
= (1 + r2 )e r , r > 0.
Similarly,

2
1
P(R2 > r) = 1 + r2 + ( r2 )2 e r , r > 0.
2

442

Further Topics from Information Theory

Then
ER0 =
ER1 =

0
0

1
P(R0 > r)dr =
2

e r d
2

2 r =

1
,
2

P(R1 > r)dr

0

2
1
e r r2 dr
= +
0
2
0

2
1
1

+
2 r2 e r d 2 r
=
2 2 2 0
3
= ,
4
2
0
r2 r2
3
e
ER2 = +
dr
2
0
4
0

2
2
1
3
= +
2 r2 e r d 2 r
4 8 2 0
3
3
= +
4 16
15
= .
16

We shall use for the PRM M on the phase space E with intensity measure constructed in Theorem 4.4.6 the notation PRM(E, ). Next, we extend the definition
of the PRM to integral sums: for all functions g : E R+ define
0

M(E)

M(g) =

g(Xi ) :=

g(y)dM(y);

(4.4.9)

i=1

summation is taken over all points Xi E, and M(E) is the total number of such
points. Next, for a general g : E R we set
M(g) = M(g+ ) M(g ),
with the standard agreement that + a = + and a = for all a (0, ).
[When both M(g+ ) and M(g ) equal , the value M(g) is declared not defined.]
Then
Theorem 4.4.8 (Campbell theorem) For all R and for all functions g : E R
such that e g(y) 1 is -integrable

0

Ee M(g) = exp
(4.4.10)
e g(y) 1 d (y) .
E

4.4 Spatial point processes and network information theory

Proof

443

Write

Ee M(g) = E E e M(g) |M(E)

k
= P(M(E) = k)E exp g(Xi ) |M(E) = k .
i=1

Owing to conditional independence (4.4.7),

k

k
k
= Ee g(Xi ) = Ee g(X1 )
E exp g(Xi ) |M(E) = k
i=1
i=1

k
1 1 g(x)
=
e
d (x) ,
(E) E
and

Ee M(g) = e (E)

( (E))k

= e (E) exp

(E)

k
e g(x) d (x)

e g(x) d (x)

= exp

e g(x) 1 d (x) .

Corollary 4.4.9

The expected value of M(g) is given by

EM(g) =

g(y)d (y);

it exists if and only if the integral on the RHS is well defined.

Proof

The proof follows by differentiation of the MGF at = 0.

Example 4.4.10 Suppose that the wireless transmitters are located at the points
of Poisson process on R2 of rate . Let ri be the distance from transmitter i to
the central receiver at 0, and the minimal distance to a transmitter is r0 . Suppose
that the power of the received signal is Y = Xi rP for some > 2. Then
i

0

Ee Y = exp 2
(4.4.11)
e g(r) 1 rdr ,
r0

where g(r) =

P
r

where P is the transmitter power.

444

Further Topics from Information Theory

A popular model in application is the so-called marked point process with the
space of marks D. This is simply a random measure on Rd D or on its subset. We
will need the following product property proved below in the simplest set-up.
Theorem 4.4.11 (The product theorem) Suppose that a Poisson process with the
constant rate is given on R, and marks Yi are IID with distribution . Define a
random measure M on R+ D by

M(A) =

(Tn ,Yn ) A , A R+ D.

(4.4.12)

n=1

This measure is a PRM on R+ D, with the intensity measure m where m is

a Lebesgue measure.
Proof First, consider a set A [0,t) D where t > 0. Then
Nt

M(A) =

1((Tn ,Yn ) A).

n=1

Consider the MGF Ee M(A) and use standard conditioning

Ee M(A) = E E e M(A) |Nt

= P(Nt = k)E e M(A) |Nt = k .

k=0

We know that Nt Po( t). Further, given that Nt = k, the jump points T1 , . . . , Tk
have the conditional joint PDF fT1 ,...,Tk ( |Nt = k) given by (4.4.7). Then, by using
further conditioning, by T1 , . . . , Tk , in view of the independence of the Yn , we have

E e M(A) |Nt = k

= E E e M(A) |Nt = k; T1 , . . . , Tk
0t

...

=
0

dxk . . . dx1 fT1 ,...,Tk (x1 , . . . , xk |N = k)

E exp

1
tk

I (xi ,Yi ) A |Nt = k; T1 = x1 , . . . , Tk = xk
k

i=1

0t 0
0 D

e IA (x,y) d (y)dx .

4.4 Spatial point processes and network information theory

Then
Ee M(A) = e t

= exp

0t 0

( t)k

k=0

1
k! t k

445

e IA (x,y) d (y)dx

0 D

e IA (x,y) 1 d (y)dx .

0 D

The expression e IA (x,y) 1 takes value e 1 for (x, y) A and 0 for (x, y) A.
Hence,

0

(4.4.13)
Ee M(A) = exp e 1 d (y)dx , R.
A

Therefore, M(A) Po( m (A)).

Moreover, if A1 , . . . , An are disjoint subsets of [0,t)D then the random variables
M(A1 ), . . . , M(An ) are independent. To see this, note first that, by definition, M is
additive: M(A) = M(A1 ) + + M(An ) where A = A1 An . From (4.4.13)

n 0
n

Ee M(A) = exp e 1 d (y)dx = Ee M(Ai ) , R,
i=1

i=1

which implies independence.

So, the restriction of M to E n = [0, n) D is an (E n , dmn ) PRM, where
mn = m|[0,n) . Then, by the extension property, M is an (R+ D, m ) PRM.
Worked Example 4.4.12 Use the product and Campbells theorems to solve
the following problem. Stars are scattered over three-dimensional space R3 in a
Poisson process with density (X) (X R3 ). Masses of the stars are IID random
variables; the mass mX of a star at X has PDF (X, dm). The gravitational potential
at the origin is given by
GmX
,
F=
X |X|

where G is a constant. Find the MGF Ee F .

A galaxy occupies a sphere of radius R centred at the origin. The density of
stars is (x) = 1/|x| for points x inside the sphere; the mass of each star has
the exponential distribution with mean M . Calculate the expected potential due to
the galaxy at the origin. Let C be a positive constant. Find the distribution of the
distance from the origin to the nearest star whose contribution to the potential F is
at least C.

446

Further Topics from Information Theory

Solution Campbells theorem says that if M is a Poisson random measure on the

space E with intensity measure and a : E R is a bounded measurable function
then

0

e a(y) 1 (dy) ,
Ee = exp
E

where
=

a(X).

a(y)M(dy) =

By the product theorem, pairs (X, mX ) (position, mass) form a PRM on R3

R+ , with intensity measure (dx dm) = (x)dx (x, dm). Then by Campbells
theorem:

0 0

(dx dm) e Gm/|x| 1 .
Ee F = exp
R3 0

dEe F
|
and equals
d =0

The expected potential at the origin is EF =

0
R3

(x)dx

(x, dm)

Gm
= GM
|x|

1
1(|x| R).
|x|2

In the spherical coordinates,

0
R3

1
dx 2 1(|x| R) =
|x|

1 2
r
r2

d cos

d = 4 R

which yields
EF = 4 GMR.
Finally, let D be the distance to the nearest star contributing to F at least C. Then,
by the product theorem,

P(D d) = P(no points in A) = exp (A) .
Here

K
%
Gm
C ,
A = (x, m) R3 R+ : |x| d,
|x|

and (A) =

4.4 Spatial point processes and network information theory

447

(dx dm) is represented as

1
dr r2 d cos
r

d M

dmem/M

Cr/G

= 4 drreCr/(GM)
0

= 4

GM
C

2
Cd Cd/(GM)
Cd/(GM)
e

1e
.
GM

This determines the distribution of D on [0, R].

In distributed systems of transmitters and receivers like wireless networks of
mobile phones the admissible communication rate between pairs of nodes in the
wireless network depends on their random positions and their transmission strategies. Usually, the transmission is performed along the chain of transmitters from
the source to destination. So, the new interesting direction in information theory
has emerged; some experts even coined the term network information theory.
This field of research has many connections with probability theory, in particular, percolation and spatial point processes. We do not attempt here to give even
a glimpse of this rapidly developing field, but no presentation of information theory nowadays can completely avoid network aspects. Here we touch slightly a few
topics and refer the interested reader to [48] and the literature cited therein.
Example 4.4.13 Suppose that the receiver is located at point y and the transmitters are scattered on the plane R2 at the points of xi of Poisson process of rate
. Then the simplest model for the power of the received signal is
Y=

P(|xi y|)

(4.4.14)

where P is the emitted signal power and the function describes the fading of the
signal. In the case of so-called Rayleigh fading (|x|) = e |x| , and in the case of
the power fading (|x|) = |x| , > 2. By the Campbell theorem
0

( ) = E e Y = exp 2
r e P(r) 1 dr .
(4.4.15)
0

A more realistic model of the wireless network may be described as follows.

Suppose that receivers are located at points y j , j = 1, . . . , J, and transmitters are
scattered on the plane R2 at the points of xi of Poisson process of rate .

448

Further Topics from Information Theory

Assuming that the signal Sk from the point xk is amplified by the coefficient
we write the signal
Yj =

h jk Sk + Z j , j = 1, . . . , J.

(4.4.16)

Here the simplest model of the transmission function is

h jk =

e2 ir jk /
P /2 ,
r jk

(4.4.17)

where is the transmission wavelength and r jk = |y j xk |. The noise random variables Z j are assumed to be IID N(0, 02 ). A similar formula could be written for
Rayleigh fading. We know that in the case of J = 1 and a single transmitter K = 1,
by the NyquistShannon theorem of Section 4.3, the capacity of the continuous
time, additive white Gaussian noise channel Y (t) = X(t)(x, y) + Z(t) with attenu1 /2
ation factor (x, y), subject to the power constraint /2 X 2 (t)dt < P , bandwidth
W , and noise power spectral density 02 , is

P2 (x, y)
C = W log 1 +
.
(4.4.18)
2W 02
Next, consider the case of finite numbers K of transmitters and J of receivers
K

y j (t) = (xi , y j )xi (t) + z j (t), j = 1, . . . , J,

(4.4.19)

i=1

with a power constraint Pk for transmitter k = 1, . . . , K. Using Worked Example

4.3.5 for the capacity of parallel channels it can be proved (cf. [48]) that the capacity of the channel is

K
Pk s2k
(4.4.20)
C = W log 1 +
2W 02
k=1

where sk is the kth largest singular value of the matrix L = (|y j xk |) . Next,
we assume that the bandwidth W = 1. It is also interesting to describe the capacity
region of a distributed system with K transmitters and J receivers under constant,
on average, power of transmission K 1 k Pk P. Again, the interested reader is
referred to [48] where the following capacity domain for allowable rates Rk j is
established:

K J
K
Pk s2k
(4.4.21)
Rk j Pk 0,max
log 1 + 2 2 .
k Pk KP k=1
0
k=1 j=1
Theorem 4.4.14 Consider an arbitrary configuration S of 2n nodes placed inside

the box Bn of area n (i.e. size n); partition them into two sets S1 and S2 , so

4.4 Spatial point processes and network information theory

449

that S1 S2 = 0/ , S1 S2 = S, S1 = S2 = n. The sum Cn = Rk j of reliable

k=1 j=1

transmission rates in model (4.4.19) from the transmitters xk S1 to the receivers

yi S2 is bounded from above:
n

Cn =

Rk j

k=1 j=1

Pk s2k
log
1
+
,

Pk 0, Pk nP k=1
202
n

max

where sk is the kth largest singular value of the matrix L = ((xk , y j )), 02 is the
noise power spectral density, and the bandwidth W = 1.
This result allows us to find the asymptotic of capacity as n . In the most
2
n) ); in the case of power
interesting case of Rayleigh fading R(n) = Cn /n O( (log
n

> 2 fading, R(n) O( n

1/ (log n)2

): see again [48].

Next we discuss the interference limited networks. Let be a Poisson process of

rate on the plane R2 . Let the function : R2 R2 R+ describing an attenuation
factor of signal emitted from x at point y be symmetric: (x, y) = (y, x), x, y R2 .
P
The most popular examples are (x, y) = Pe |xy| and (x, y) = |xy|
, > 2. A
general theory is developed under the following assumptions:
1
(i) (x, y) = (|x y|), r0 r(r)dr < for some r0 > 0.
(ii) l(0) > k02 /P, (x) 1 for all x > 0 where k > 0 is an admissible level of
interference.
(iii) is continuous and strictly decreasing where it is non-zero.
For each pair of points xi , x j define the signal/noise ratio
SNR(xi x j ) =

P2 (xi , x j )
02 + k=i, j P2 (xk , x j )

(4.4.22)

where P, 02 , k > 0 and 0 < 1k . We say that a transmitter located at xi can send a
message to receiver located at x j if SNR(xi x j ) k. For any k > 0 and 0 < < 1,
let An (k, ) be an event that there exists a set Sn of at least n points of such that
for any two points s, d Sn , SNR(s, d) > k. It can be proved (see [48]) that for all
(0, 1) there exists k = k( ) such that

lim P An (k( ), ) = 1.

(4.4.23)

Then we say that the network is supercritical at interference level k( ); it means

that the number of other points the given transmitter (say, located at the origin 0)
could communicate to, by using re-transmission at intermediate points, is infinite
with a positive probability.

450

Further Topics from Information Theory

First, we note that any given transmitter may be directly connected to at most
1 + ( k)1 receivers. Indeed, suppose that nx nodes are connected to the node x.
Denote by x1 the node connected to x and such that
(|x1 x|) (|xi x|), i = 2, . . . , nx .

(4.4.24)

Since x1 is connected to x we have

P(|x1 x|)

02 + P(|xi x|)

i=2

which implies
P(|x1 x|) k02 + k P(|xi x|)

i2
2
k0 + k (nx 1)P(|x1 x|) + k

P(|xi x|)

inx +1

k (nx 1)P(|x1 x|).

(4.4.25)

We conclude from (4.4.25) that nx 1 + (k )1 . However, the network percolates

for some values of parameters in view of (4.4.23). This means that with positive
probability a given transmitter may be connected to an infinite number of others with re-transmissions. In particular, the model percolates for = 0, above the
critical rate for percolation of Poisson flow cr . It may be demonstrated that for
> cr the critical value of ( ) first increases with but then starts to decay
because the interference becomes too strong. The proof of the following result may
be found in [48].
Theorem 4.4.15 Let cr be the critical node density for = 0. For any node
density > cr , there exists ( ) > 0 such that for ( ), the interference
model percolates. For we have that

( ) = O( 1 ).

(4.4.26)

Another interesting connection with the theory of spatial point processes in RN

is in using realisations of point processes for producing random codebooks. An alternative (and rather efficient) way to generate a random coding attaining the value
C( ) in (4.1.17) is as follows. Take a Poisson process (N) in RN , of rate N = eNRN
where RN R as N . Here R < 12 log 2 1e 2 where 02 be the variance of additive
0

Gaussian noise in a channel. Enlist in the codebook XM,Nthe random points X(i)
from process (N) lying inside the Euclidean ball B(N) ( N ) and surviving the
following purge. Fix r > 0 (the minimal distance of the random code) and for any
point X j of a Poisson process (N) generate an IID random variable T j U([0, 1])
(a random mark). Next, for every point X j of the original Poisson process examine

4.4 Spatial point processes and network information theory

451

the ball B(N) (X j , r) of radius r centred at X j . The point X j will survive only if its
mark T j is strictly smaller than the marks of all other points from (N) lying in
B(N) (X j , r). The resulting point process (N) is known as the Matern process; it is
an example of a more general construction discussed in the recent paper [1].
The main parameter of a random codebook with codewords x(N) of length N is
the induced distribution of the distance between codewords. In the case of codebooks generated by stationary point processes it is convenient to introduce a function K(t) such that 2 K(t) gives the expected number of ordered pairs of distinct
points in a unit volume less than distance t apart. In other words, K(t) is the expected number of further points within t of an arbitrary point of a process. Say,
for Poisson process on R2 of rate , K(t) = t 2 . In random codebooks we are interested in models where K(t) is much smaller for small and moderate t. Hence,
random codewords appear on a small distance from one another much more rarely
than in a Poisson process. It is convenient to introduce the so-called product density

(t) =

2 dK(t)
,
c(t) dt

(4.4.27)

where c(t) depends on the state space of the point process. Say, c(t) = 2 t on R1 ,
c(t) = 2 t 2 on R2 , c(t) = 2 sint on the unit sphere, etc.
Some convenient models of this type have been introduced by B. Matern. Here
we discuss two rather intuitive models of point processes on RN . The first is obtained by sampling a Poisson process of rate and deleting any point which is
within 2R of any other whether or not this point has already been deleted. The rate
of this process for N = 2 is

M,1 = e4 R .
2

(4.4.28)

The product density k(t) = 0 for t < 2R, and

(t) = 2 e2U(t) ,t > 2R,

where
U(t) = meas[B((0, 0), 2R) B((t, 0), 2R)].

(4.4.29)

Here B((0, 0), 2R) is the ball with centre (0, 0) of radius 2R, and B((t, 0), 2R) is
the ball with centre (t, 0) of radius 2R. For varying this model has the maximum
rate of (4 eR2 )1 and
so cannot model densely packed codes. This is 10% of the
theoretical bound ( 12R2 )1 which is attained by the triangular lattice packing,
cf. [1].
The second Matern model is an example of the so-called marked point process.
The points of a Poisson process of rate are independently marked by IID random
variables with distribution U([0, 1]). A point is deleted if there is another point of

452

Further Topics from Information Theory

the process within distance 2R which has a bigger mark whether or not this point
has already been deleted. The rate of this process for N = 2 is

M,2 = (1 e c )/c, c = U(0) = 4 R2 .

(4.4.30)

The product density (t) = 0 for t < 2R, and

2
2U(t) 1 e4 R 2c 1 e U(t)
,t > 2R.
(t) =
cU(t)(U(t) c)

(4.4.31)

An equivalent definition is as follows. Given two points X and Y of the primary

Poisson process on the distance t = |X Y | define the probability k(t) = (t)/ 2
that both of them are retained in the secondary process. Then k(t) = 0 for t < 2R,
and
2U(t)(1 e4 R ) 8 R2 (1 e U(t))
, t > 2R.
4 2 R2U(t)(U(t) 4 R2 )
2

k(t) =

Example 4.4.16 (Outage probability in a wireless network) Suppose a receiver is

located at the origin and transmitters are distributed according to the Matern hardcore process with the inner radius r0 . We suppose that no transmitters are closer to
each other than r0 and the coverage distance is a. The sum of received powers at
the central receiver from the signals from all wireless network is written as
Xr0 =

(4.4.32)

Jr0 ,a i

where Jr0 ,a denotes the set of interfering transmitters such that r0 ri < a. Let P
be the rate of Poisson process producing a Matern process after thinning. The rate
of thinned process is

1 exp P r02
=
.
r02
Using the Campbell theorem we compute the MGF of Xr0 :

( ) = E e Xr0

= exp P (a

r02 )

0 1

q(t)dt
0

2r
g(r)
dr 1 . (4.4.33)
e
(a2 r02 )

4.5 Selected examples and problems from cryptography

453

P
and q(t) = exp P r02t is the retaining probability of a point

r 0
1

q(t)dt = , we obtain
of mark t. Since
P
0

0 a

2r
2
2
g(r)
e
( ) = exp (a r0 )
dr 1 .
(4.4.34)
2
r0 (a2 r0 )

Here g(r) =

Now we can compute all absolute moments of the interfering signal:

k =

0 a
r0

2r(g(r))k dr =

2 Pk
Pk
.

k 2 r0k 2 ak 2

(4.4.35)

Engineers say that outage happens at the central receiver, i.e. the interference prevents one from reading a signal obtained from a sender at distance rs , if
P/rs
k.
02 + Jr0 ,a P/ri
Here, 02 is the noise power, rs is the distance to sender and k is the minimal SIR
(signal/noise ratio) required for successful reception. Different approximations of
outage probability based on the moments computed in (4.4.35) are developed. Typically, the distribution of Xr0 is close to log-normal; see, e.g., [113].

4.5 Selected examples and problems from cryptography

Cryptography, commonly defined as the practice and study of hiding information,
became a part of many courses and classes on coding; in our exposition we mainly
follow the traditions of the Cambridge course Coding and Cryptography. We keep
the theoretical explanations to the bare minimum and refer the reader to specialised
books for details. Cryptography has a long and at times fascinating history where
mathematics is interleaved with other sciences and even non-sciences. It has inspired countless fiction and half-fiction books, films, and broadcast programmes;
its popularity does not seem to be waning.
A popular method of producing encrypted digit sequences is through the socalled feedback shift registers. We will restrict ourselves to the binary case, working with string spaces Hn,2 = {0, 1}n = Fn
2 .
Definition 4.5.1 A (general) binary feedback shift register of length d is a map
{0, 1}d {0, 1}d of the form

(x0 , . . . , xd1 ) x1 , . . . , xd1 , f (x0 , . . . , xd1 )

454

Further Topics from Information Theory

for some function f : {0, 1}d {0, 1} (a feedback function). The initial string
(x0 , . . . , xd1 ) is called an initial fill; it produces an output stream (xn )n0 satisfying
the recurrence equation
xn+d = f (xn , . . . , xn+d1 ),

for all n 0.

(4.5.1)

A feedback shift register is said to be linear (an LFSR, for short) if function f is
linear and c0 = 1:
d1

f (x0 , . . . , xd1 ) =

ci xi ,

where ci = 0, 1, c0 = 1;

(4.5.2)

i=0

in this case the recurrence equation is linear:

xn+d =

ci xn+i

for all n 0.

(4.5.3)

i=0

It is convenient to write (4.5.3) in the matrix form

n+d1
xn+d
n+1 = Vxn

where

0
0

V = ...

1
0
..
.

0
1
..
.

...
...
..
.

0
0
..
.

0 0 ...
0
c0 c1 c2 . . . cd2

0
0
..
.

(4.5.4)

xn+1

n+d1
=
,
x
. .

xn+d2
1
cd1
xn+d1

(4.5.5)

By the expansion of the determinant along the first column one can see that det V =
1 mod 2: the cofactor for the (n, 1) entry c0 is the matrix Id1 . Hence,
det V = c0 det Id1 = c0 = 1, and the matrix V is invertible.

(4.5.6)

A useful concept is the auxiliary, or feedback, polynomial of an LFSR from

(4.5.3):
C(X) = c0 + c1 X + + cd1 X d1 + X d .

(4.5.7)

Observe that general feedback shift registers, after an initial run, become periodic:
Theorem 4.5.2 The output stream (xn ) of a general feedback shift register of
length d has the property that there exists integer r, 0 r < 2d , and integer D,
1 D < 2d r, such that xk+D = xk for all k r.

4.5 Selected examples and problems from cryptography

455

Proof A segment xM . . . xM+d1 determines uniquely the rest of the output stream
in (4.5.1), i.e. (xn , n M + d 1). We see that if such a segment is reproduced in
the stream, it will be repeated. There are 2d different possibilities for a string of d
subsequent digits. Hence, by the pigeonhole principle, there exists 0 r < R < 2d
such that the two segments of length d of the output stream, from positions r and
R onwards, will be the same: xr+ j = xR+ j , 0 r < R < d. Then, as was noted,
xr+ j = xR+ j for all j 0, and the assertion holds true with D = R r.
In the linear case (LFSR), we can repeat the above argument, with the zero string
discarded. This allows us to reduce 2d to 2d 1. However, an LFSR is periodic in
a proper sense:
Theorem 4.5.3 An LFSR (xn ) is periodic, i.e. there exists D 2d 1 such that
xn+D = xn for all n. The smallest D with this property is called the period of the
LFSR.
Proof Indeed, the column vectors xn+d1
, n 0, are related by the equation
n
xn+1 = Vxn = Vn+1 x0 , n 0, where matrix V was defined in (4.5.5). We noted that
det V = c0 = 0 and hence V is invertible. As was said before, we may discard the
zero initial fill. For each vector xn {0, 1}d there are only 2d 1 non-zero possibilities. Therefore, as in the proof of Theorem 4.5.2, among the initial 2d 1 vectors
xn , 0 n 2d 2, either there will be repeats, or there will be a zero vector. The
second possibility can be again discarded, as it leads to the zero initial fill. Thus,
suppose that the first repeat was for j and D + j: x j = x j+D , i.e. V j+D x0 = V j x0 .
If j = 0, we multiply by V1 and arrive at an earlier repeat. So: j = 0, D 2d 1
and VD x0 = x0 . Then, obviously, xn+D = Vn+D x0 = Vn x0 = xn .
Worked Example 4.5.4 Give an example of a general feedback register with
output k j , and initial fill (k0 , k1 , . . . , kN ), such that
(kn , kn+1 , . . . , kn+N ) = (k0 , k1 , . . . , kN ) for all n 1.
Solution Take f : {0, 1}2 {0, 1}2 with f (x1 , x2 ) = x2 1. The initial fill 00 yields
00111111111 . . .. Here, kn+1 = 0 = k1 for all n 1.
Worked Example 4.5.5 Let matrix V be defined by (4.5.5), for the linear recursion (4.5.3). Define and compute the characteristic and minimal polynomials
for V.

456

Further Topics from Information Theory

Solution The characteristic polynomial of matrix V is hV (X) F2 [X] = X

det(XI V):

X
0

hV (X) = det ...

0
c0

1
X
..
.

0
1
..
.

...
...
..
.

0
0
..
.

(4.5.8)

0 0 ... X
1
c1 c2 . . . cd2 (cd1 + X)

(recall, entries 1 and ci are considered in F2 ). Expanding along the bottom row,
the polynomial hV (t) is written as a linear combination of determinants of size
(d 1) (d 1) (co-factors):

1 0 ... 0 0
X 0 ... 0 0
X 1 . . . 0 0
X 1 . . . 0 0

c0 det . . .
+ c1 det . . .

.
.. ..
.
.
.
.
.
.
.
.
.
. .
. .
. . .
. . .
0

0 ... X

0 0 ... X 1

1 ... 0 0
X . . . 0 0

.. . . .. ..
. . .
.
0 0 ... 0 1

X 1 ... 0
0 X ... 0

+(cd1 + X) det . . .
. . ...
.. ..
1

X
0

+ + cd2 det .
..

0
0

..
.

... 0 X

= c0 + c1 X + + cd2 X d2 + (cd1 + X)X d1

= ci X i + X d ,
0id1

which gives the characteristic polynomial C(X) of the recursion.

By the CayleyHamilton theorem,
hV (V) = c0 I + c1 V + + cd1 Vd1 + Vd = O.
The minimal polynomial, mV (X), of matrix V is the polynomial of minimal
degree such that mV (V) = O. It is a divisor of hV (X), and every root of hV (X) is a
root of mV (X). The difference between mV (X) and hV (X) is in multiplicities: the
multiplicity of a root of mV (X) equals the maximal size of the Jordan cell of V
corresponding to whereas for hV (X) it is the sum of the sizes of all Jordan cells
in V corresponding to .

4.5 Selected examples and problems from cryptography

457

To calculate mV (X), we:

(i) take a basis e1 , . . . , ed (in Fd
2 );
(ii) then for any vector e j we find the minimal number d j such that vectors e j ,
Ve j , . . . , Vd j e j , Vd j +1 e j are linearly dependent;
(iii) identify the corresponding linear combination
( j)

( j)

a0 e j + a1 Ve j + + ad j Vd j e j + Vd j +1 e j = 0.
(iv) Further, we form the corresponding polynomial
( j)

mV (X) =

( j)

ai X i + X d j +1 .

0id j

(v) Then,

(1)
(d)
mV (X) = lcm mV (X), . . . , mV (X) .

In our case, it is convenient to take

0
..
.

ej =
1
..
.
0

..
.
j.
..
.

Then V j e1 = e j , and we obtain that d1 = d, and

(1)

mV (X) =

ci X i + X d = hV (X).

0id1

We see that the feedback polynomial C(X) of the recursion coincides with the
characteristic and the minimal polynomial for V. Observe that at X = 0 we obtain
hV (0) = C(0) = c0 = 1 = det V.

(4.5.9)

Any polynomial can be identified through its roots; we saw that such a description may be extremely useful. In the case of an LFSR, the following example is
instructive.
Theorem 4.5.6 Consider the binary linear recurrence in (4.5.3) and the corresponding auxiliary polynomial C(X) from (4.5.7).
(a) Suppose K is a field containing F2 such that polynomial C(X) has a root of
multiplicity m in K. Then, for all k = 0, 1, . . . , m 1,
xn = A(n, k) n , n = 0, 1, . . . ,

(4.5.10)

458

Further Topics from Information Theory

is a solution to (4.5.3) in K, where

k = 0,
1,

A(n, k) =

(n l)+ mod 2, k 1.

(4.5.11)

0lk1

Here, and below, (a)+ stands for max[a, 0]. In other words, sequence x(k) =
(xn ), where xn is given by (4.5.10), is an output of the LFSR with auxiliary
polynomial C(X).
(b) Suppose K is a field containing F2 such that C(X) factorises in K into linear factors. Let 1 , . . . , r K be distinct roots of C(X) of multiplicities
m1 , . . . , mr , with mi = d . Then the general solution of (4.5.3) in K is
1ir

xn =

1ir 0kmi 1

bi,k A(n, k)in

(4.5.12)

for some bu,v K. In other words, sequences x(i,k) = (xn ), where xn = A(n, k)in
and A(n, k) is given by (4.5.11), span the set of all output streams of the LFSR
with auxiliary polynomial C(X).
Proof (a) If C(X) has a root K of multiplicity m then C(X) = (X )mC(X)
where C(X) is a polynomial of degree d m (with coefficients from a field K K).
Then, for all k = 0, . . . , m 1, and for all n d, the polynomial

dk
Dk,n (X) := X k k X nd C(X)
dX
(with coefficients taken mod 2) vanishes at X = (in field K):
Dk,n ( ) =

0id1

ci A(n d + i, k) nd+i + A(n, k) n .

This yields
A(n, k) n =

ci A(n d + i, k) nd+i .

0id1

Thus, stream x(k) = (xn ) with xn as in (4.5.10) solves the recursion xn =

ci xnd+i in K. The number of such solutions equals m, the multiplicity
0id1

of root .
(b) First, observe that the set of output streams (xn )n0 forms a linear space W over
K (in the set of all sequences with entries from K). The dimension of W equals d,
as every stream is uniquely defined by a seed (initial fill) x0 x1 . . . xd1 Kd . On the
(i,k)
other hand, d = mi , the total number of sequences x(i,k) = xn
with entries
1ir

(i,k)

= A(n, k)in , n = 0, 1, . . . .

4.5 Selected examples and problems from cryptography

459

Thus, it suffices to check that the streams x(i,k) , where i = 1, . . . , r,

k = 0, 1 . . . , mi 1, are linearly independent over K.
bi,k x(i,k) and assume it gives
To this end, take a linear combination

1ir 0kmi 1

x(i,k)

0. Let us also agree that sequence

= 0 for k < 0. It is convenient to introduce
a shift operator x = (xn ) Sx where sequence Sx = (xn ) has entries xn = xn+1 ,
n = 0, 1, . . .. The key observation is as follows. Let I stand for the identity transformation. Then for all K,
(S I)x(i,k) = (i )x(i,k) + ki x(i,k1) .
In fact, the nth entry of the sequence (S I)x(i,k) equals
A(n + 1, k)in+1 A(k, n)in
= [A(n, k) + kA(n, k 1)] in+1 A(n, k)in
= (i )A(n, k)in + ki A(n, k 1)in ,
in agreement with the above equation for sequences. We have used here the elementary equation
A(n + 1, k) = A(n, k) + kA(n, k 1).
Then, iterating, we obtain
(S 1 I)(S 2 I)x(i,k) = (i 1 )(i 2 )x(i,k)
+ ki (i 1 + i 2 )x(i,k1) + k2 i2 x(i,k2)
= (S 2 I)(S 1 I)x(i,k) ,
and so on (all operations with coefficients are performed in field K). In particular,
with = i :
%
(ki )l x(i,kl) , 1 l k,
l (i,k)
(S i I) x
=
0, l > k.
Now consider the product of operators (S i I)mi (S r I)mr 1 applied to
1i<r

bi,k x(i,k) . The

1ir 0kmi 1
br,mr 1 x(r,mr 1) . This gives

our vanishing linear combination

vives comes from the summand
br,mr 1

only term that sur-

(i r )m [(mr 1)r ]m 1 x(i,0) = 0.

1i<r

Hence, br,mr 1 = 0. Next, we apply (S i I)mi (S r I)mr 2 to obtain that

1i<r

br,mr 2 = 0. Continuing in a similar fashion, we can guarantee that each coefficient

bi,k = 0.

460

Further Topics from Information Theory

Upon seeing a stream of digits (xn )n0 , an observer may wish to determine
whether it was produced by an LFSR. This can be done by using the so-called
BerlekampMassey (BM) algorithm, solving a system of linear equations. If a sed1

quence (xn ) comes from an LFSR with feedback polynomial C(X) = ci X i + X d

i=0

then the recurrence xn+d = ci xn+i for n = 0, . . . , d can be written in a vectori=0

matrix form Ad cd = 0 where

x0
x1

Ad = .
..

x1
x2
..
.

x2
x3
..
.

xd xd+1 xd+2

. . . xd
. . . xd+1

.. ,
..
.
.
. . . x2d

c0
c1
..
.

cd =
.

cd1
1

(4.5.13)

Consequently, the (d + 1) (d + 1) matrix Ad must have determinant 0, and the

(d + 1)-dimensional vector cd must lie in the null-space ker Ad .
The algorithm begins with an inspection of matrix Ar for a small value of r
(known to be d):

x0
x1

Ar = .
..
xr

x1
x2
..
.

x2
x3
..
.

xr+1 xr+2

. . . xr
. . . xr+1

.. .
..
.
.
. . . x2r

We calculate det Ar : if det Ar = 0, we conclude that d = r and increase r by 1. If

det Ar = 0 then we solve the equation Ar ar = 0, i.e. try d = r:

x0 x1 . . . xd

x1 x2 . . . xd+1

Ad
= 0, where Ad = ..
..
..
..

.
.
.
.
ad1
xd xd+1 . . . x2d
1
a0
a1
..
.

(e.g. by Gaussian elimination) and test sequence (xn ) for the recursion xn+d =
ai xn+i . If we discover a discrepancy, we choose a different vector cr
0id1

ker Ar or if it fails increase r.

The BM algorithm can be stated in an elegant algebraic form. Given a sequence

(xn ), consider a formal power series in X: x j X j . The fact that (xn ) is produced
j=0

4.5 Selected examples and problems from cryptography

461

by the LFSR with a feedback polynomial C(X) is equivalent to the fact that the
d

above series is obtained by dividing a polynomial A(X) = ai X i by C(X):

i=0

A(X)

x j X j = C(X) .

(4.5.14)

j=0

Indeed, as c0 = 1, A(X) = C(X) x j X j is equivalent to

j=0
n

an = ci xni , n = 1, . . . ,

(4.5.15)

i=1

an ci xni ,
xn =

i=1
n1

ci xni ,
i=0

n = 0, 1, . . . , d,
(4.5.16)
n > d.

In other words, A(X) takes part in specifying the initial fill, and C(X) acts as the
feedback polynomial.
Worked Example 4.5.7 What is a linear feedback shift register? Explain the
BerlekampMassey method for recovering the feedback polynomial of a linear
feedback shift register from its output. Illustrate in the case when we observe outputs
1 0 1 0 1 1 0 0 1 0 0 0 ...,
0 1 0 1 1 1 1 0 0 0 1 0 ...

and
1 1 0 0 1 0 1 1.
Solution An initial fill x0 . . . xd1 produces an output stream (xn )n0 satisfying the
recurrence equation
d1

xn+d =

ci xn+i

for all n 0.

i=0

The feedback polynomial

C(X) = c0 + c1 X + + cd1 X d1 + X d
is the characteristic polynomial for this recurrence equation determining its solutions. We will assume that coefficient c0 = 0; otherwise value xn has no impact on
xn+d and the register can be treated as the one of length d 1.

462

Further Topics from Information Theory

The BerlekampMassey algorithm begins with an inspection of matrix

1 0
, with det A1 = 0,
A1 =
0 1
but

1 0 1
A2 = 0 1 0 , with det A2 = 0,
1 0 1

and A2 c1 = 0 has the solution c0 = 1, c1 = 0. This gives the recursion

1
xn+2 = xn ,
which does not fit the remaining digits. So, we move to A3 :

1 0 1 0
0 1 0 1

A3 =
1 0 1 1 , with det A3 = 0,
0 1 1 0
and then to A4 :

1
0

A4 =
1
0
1

0
1
0
1
1

1
0
1
1
0

0
1
1
0
0

1
1

0
, with det A4 = 0.
0
1

1
0

The equation A4 c4 = 0 is solved by c4 =

0. This yields
1
xn+4 = xn + xn+3 ,
which fits the rest of the string. In the second example we have:

0 1 0

0 1 0

0 1
1 0 1
det
= 0, det 1 0 1 = 0, det
0 1 1
1 0
0 1 1
1 1 1

1
1
= 0
1
1

4.5 Selected examples and problems from cryptography

and

0
1

1
1

1
0
1
1
1

0
1
1
1
1

1
1
1
1
0

463

1
1
1
1

1
0 = 0.

0 0
0
1

This yields the solution: d = 4, xn+4 = xn + xn+1 . The linear recurrence relation is
satisfied by every term of the output sequence given. The feedback polynomial is
then X 4 + X + 1.
In the third example the recursion is xn+3 = xn + xn+1 .
LFSRs are used for producing additive stream ciphers. Additive stream ciphers
were invented in 1917 by Gilbert Vernam, at the time an engineer with the AT&T
Bell Labs. Here, the sending party uses an output stream from an LFSR (kn ) to
encrypt a plain text (pn ) by (zn ) where
zn = pn + kn mod 2, n 0.

(4.5.17)

The recipient would decrypt it by

pn = zn + kn mod 2, n 0,

(4.5.18)

but of course he must know the initial fill k0 . . . kd1 and the string c0 . . . cd1 . The
main deficiency of the stream cipher is its periodicity. Indeed, if the generating
LFSR has period D then it is enough for an attacker to have in his possession a
cipher text z0 z1 . . . z2D1 and the corresponding plain text p0 p1 . . . p2D1 , of length
2D. (Not an unachievable task for a modern-day Sherlock Holmes.) If by some luck
the attacker knows the value of the period D then he only needs z0 z1 . . . zD1 and
p0 p1 . . . pD1 . This will allow the attacker to break the cipher, i.e. to decrypt the
whole plain text, however long.
Clearly, short-period LFSRs are easier to break when they are used repeatedly. The history of World War II and the subsequent Cold War has a number of
spectacular examples (German code-breakers succeeding in part in reading British
Navy codes, British and American code-breakers succeeding in breaking German
codes, the American project Venona deciphering Soviet codes) achieved because
of intensive message traffic. However, even ultra-long periods cannot guarantee
safety.
As far as this section of the book is concerned, the period of an LFSR can be
increased by combining several LFSRs.
Theorem 4.5.8 Suppose a stream (xn ) is produced by an LFSR of length d1 ,
period D1 and with an auxiliary polynomial C1 (X), and a stream (yn ) by an LFSR

464

Further Topics from Information Theory

of length d2 , period D2 and with an auxiliary polynomial C2 (X). Let 1 , . . . , r1

and 1 , . . . , r2 be the distinct roots of C1 (X) and C2 (X), respectively, lying in some
field K F2 . Let mi be the multiplicity of root i and m j be the multiplicity of root
j , with d1 = mi and d2 = m i . Then
1ir1

1ir2

(a) Stream (xn + yn ) is produced by an LFSR with the auxiliary polynomial

lcm(C1 (X),C2 (X)).
(b) Stream (xn yn ) is produced by an LFSR with the auxiliary polynomial C(X) =

(X i j )mi +m j 1 .
1ir1 1 jr2

In particular, the period of the resulting LFSR is in both cases divisible by

lcm(D1 , D2 ).
Proof According to Theorem 4.5.6, the output streams (xn ) and (yn ) for the LFSRs in question have the following form in field K:
xn =

1ir1 0kmi 1

ai,k A(n, k)in , yn =

1 jr2 0lm j 1

b j,l A(n, l) jn ,

(4.5.19)

for some ai,k , b j,l K.

(a) Writing xn +yn as the sum of the expressions from (4.5.19) and grouping similar
terms leads to the statement (a).
(b) For the product xn yn we have the expression

ai,k b j,l A(n, k)A(n, l)(i j )n .

i, j k,l

The

can
be
written
as
a
sum
product
ai,k b j,l A(n, k)A(n, l)
A(n,t)ut (ai,k , b j,l ) where coefficients ut (ai,k , b j,l ) K. This gives

kltk+l1

the following representation of xn yn :

A(n,t)

1ir1 1 jr2 0tmi +m j 2

ut (ai,k , b j,l )(i j )n

k,l:kltk+l1

which in turn can be written as

xn yn =

1ir1 1 jr2 0tmi +m j 2

A(n,t)vi, j;t (i j )n

corresponding to the generic form of the output stream for the LFSR with the auxiliary polynomial C(X) in statement (b).
Despite serious drawbacks, LFSRs remain in use in a variety of situations: they
allow simple enciphering and deciphering without lookahead and display a local effect of an error, be it encoding, transmission or decoding. More generally,

4.5 Selected examples and problems from cryptography

465

non-linear LFSRs often offer only marginal advantages while bringing serious disadvantages, in particular with deciphering.

. . . an error by the same example

Will rush into the state.
William Shakespeare (15641616), English playwright and poet
from Merchant of Venice
Worked Example 4.5.9
LFSRs. Set

(a) Let (xn ), (yn ), (zn ) be three streams produced by

kn = xn if yn = zn ,
kn = yn if yn = zn .

Show that kn is also a stream produced by a linear feedback register.

(b) A cipher stream is given by a linear feedback register of known length d . Show
that, given plain text and ciphered text of length 2d , we can find the cipher stream.
Solution (a) For three streams (xn ), (yn ), (zn ) produced by LFSRs we set
kn = xn + (xn + yn )(yn + zn )

(in F2 ).

So it suffices to note that (pointwise) sums and products of streams produced by

LFSRs also yield some streams produced by LFSRs.
(b) Suppose the plain text is y1 y2 . . . y2d , and the ciphered text is x1 + y1 x2 + y2 . . .
x2d + y2d . Then we can recover x1 . . . x2d . We know that c1 . . . cd must satisfy d
simultaneous linear equations
d

xd+ j = ci x j+i1 , for j = 1, 2, . . . , d.

i=1

Solve these to find c1 , c2 , . . . , cd and hence the cipher stream.

Worked Example 4.5.10
defining relation

A binary non-linear feedback register of length 4 has

xn+1 = xn1 + xn xn2 + xn3 .

Show that the state space contains cycles of lengths 1, 4, 9 and 2.

466

Further Topics from Information Theory

Solution There are 24 = 16 initial binary strings. By inspection,

0000 0000

(a cycle of length 1),

0001 0010 0100 1000 0001

(a cycle of length 4),

0011 0111 1111 1110 1101

1011 0110 1100 1001 0011 (a cycle of length 9),
0101 1010 0101

(a cycle of length 2).

All 16 initial fills have appeared in the list, so the analysis is complete.
Worked Example 4.5.11 Describe how an additive stream cipher operates. What
is a one-time pad? Explain briefly why a one-time pad is safe if used only once but
becomes unsafe if used many times. A one-time pad is used to send the message
x1 x2 x3 x4 x5 x6 y7 which is encoded as 0101011. By mistake, it is reused to send the
message y0 x1 x2 x3 x4 x5 x6 which is encoded as 0100010. Show that x1 x2 x3 x4 x5 x6 is
one of two possible messages, and find the two possibilities.
Solution A one-time pad is an example of a cipher based on a random key and
proposed by Gilbert Vernam and Joseph Mauborgne (the Chief of the USA Signal
Corps during World War II). The cipher uses a random number generator producing
a sequence k1 k2 k3 . . . from the alphabet J of size q. More precisely, each letter
is uniformly distributed over J and different letters are independent. A message
m = a1 a2 . . . an is encrypted as c = c1 c2 . . . cn where

ci = ai + ki mod q .
To show that the one-time pad achieves perfect secrecy, write
P(M = m,C = c) = P(M = m, K = c m)
= P(M = m)P(K = c m) = P(M = m)

1
;
qn

here the subtraction c m is digit-wise and mod q. Hence, the conditional probability
P(C = c|M = m) =

1
P(M = m,C = c)
= n
P(M = m)
q

does not depend on m. Hence, M and C are independent.

Working in F2 , consider a cipher key stream k1 k2 k3 . . .. The plain (input) text
stream p1 p2 p3 . . . is encrypted as the cipher text stream c1 c2 c3 . . ., where c j =
p j + k j . If the k j are IID random numbers and the cipher key stream is only used
once (which happens in practice) then we have a one-time pad. (It is assumed that

4.5 Selected examples and problems from cryptography

467

the cipher key stream is known only to the sender and the recipient.) In the example,
we have
x1 x2 x3 x4 x5 x6 y7 0101011,
y0 x1 x2 x3 x4 x5 x6 0100010.
Suppose x1 = 0. Then
k0 = 0, k1 = 1, x2 = 0, k2 = 0, x3 = 0, k3 = 0, x4 = 1, k1 = 0,
x5 = 1, k5 = 0, x6 = 1, k6 = 1.
Thus,
k = 0100101, x = 000111.
If x1 = 1, every digit changes, so
k = 1011010, x = 111000.
Alternatively, set x0 = y0 and x7 = y7 . If the first cipher is q1 q2 . . ., the second is
p1 p2 . . . and the one-time pad is k1 , k2 , . . ., then
q j = x j+1 + k j , p j = x j + k j .
So,
x j + x j+1 = q j + p j ,
and
x1 + x2 = 0, x2 + x3 = 0,
x3 + x4 = 1, x4 + x5 = 0, x5 + x6 = 0.
This yields
x1 = x2 = x3 , x4 = x5 = x6 , x4 = x3 + 1.
The message is 000111 or 111000.
Worked Example 4.5.12
(a) Let : Z+ {0, 1} be given by (n) = 1 if n is
odd, (n) = 0 if n is even. Consider the following recurrence relation over F2 :
un+3 + un+2 + un+1 + un = 0.

(4.5.20)

Is it true that the general solution of (4.5.20) is un = A + B (n) + C (n2 )? If it is

true, prove it. If not, explain why it is false and state and prove the correct result.
(b) Solve the recurrence relation un+2 + un = 1 over F2 , subject to u0 = 1, u1 = 0,
expressing the solution in terms of and n.
(c) Four streams wn , xn , yn , zn are produced by linear feedback registers. If we set
%
xn + yn + zn if zn + wn = 1,
kn =
xn + wn if zn + wn = 0,
show that kn is also a stream produced by a linear feedback register.

468

Further Topics from Information Theory

Solution (a) Observe that (n2 ) = (n), so the suggested

sum contains only two
arbitrary constants. Now consider g(n) = n(n 1)/2 . Then
g(n + 3) +
g(n + 2) + g(n +
1) +g(n)

(n + 3)(n + 2)
(n + 2)(n + 1)
+
=
2

2
(n + 1)n
n(n 1)
+
+
2
2

= (n + 2)2 + n2 = 0,
and g(0) = g(1) = 0, g(2) = 1. Then we substitute n = 0 and n = 1 into the relation
a (n) + b + cg(n) = 0, and observe that a = b = c = 0. So, (n), 1, g(n) are independent. Thus, A (n)+B+Cg(n) is a general solution of the third-order difference
equation.
(b) First try to solve the recurrence relation un+2 + un = 1 without additional conditions

n(n 1) (n + 2)(n + 1)
g(n) + g(n + 2) =
+
2
2
2

2
n n + n + 3n + 2
=
2

2
= n + n + 1 = 1.
Now substitute n = 0 and n = 1 into relation un = A + B (n) + g(n) to get A = B =
1. Thus, un = 1 + (n) + g(n).
(c) The sequence kn is produced by the linear register
kn = xn + wn + (zn + wn )(yn + zn + wn ).

In the next part of this section, we discuss properties of a class of cryptosystems

used in modern practice and called public-key ciphers, focusing in particular on the
RSA and the bit commitment cryptosystems.
Definition 4.5.13

We say that a formal cryptosystem is given, if we can identify:

(a) a set P of plaintexts (source messages in the language of Chapter 1);

(b) a set C of ciphertexts (codewords in the language of Chapter 1);
c) a set K of keys that label the encoding maps;
(d) the set E of encryptic functions (encoding maps) where each function Ek takes
P P Ek (P) C and is labelled by an element k K ;
(e) the set D of decryptic functions (decoding maps) where each function Dk takes
C C Dk (C) P and is again labelled by an element k K ;

4.5 Selected examples and problems from cryptography

469

such that
(f) for all key e K there is a key d K , with the property that Dd (Ee (P)) = P
for all plaintext P P.
Example 4.5.14 Suppose that two parties, Bob and Alice, intend to have a twoside private communication. They want to exchange their keys, EA and EB , by using an insecure binary channel. An obvious protocol is as follows. Alice encrypts
a plain-text m as EA (m) and sends it to Bob. He encrypts it as EB (EA (m)) and
returns it to Alice. Now we make a crucial assumption that EA and EB commute
for any plaintext m : EA EB (m ) = EB EA (m ). In this case Alice can decrypt
this message as DA (EA (EB (m))) = EB (m) and send this to Bob, who then calculates DB (EB (m)) = m. Under this protocol, at no time during the transaction is an
unencrypted message transmitted.
However, a further thought shows that this is no solution at all. Indeed, suppose
that Alice uses a one-time pad kA and Bob uses a one-time pad kB . Then any single interception provides no information about plaintext m. However, if all three
transmissions are intercepted, it is enough to take the sum
(m + kA ) + (m + kA + kB ) + (m + kB ) = m
to obtain the plaintext m. So, more sophisticated protocols should be developed:
this is where public key cryptosystems are helpful.
Another popular example is a network of investors and brokers dealing in a
market and using an open access cryptosystem such as RSA. An investors concern
is that a broker will buy shares without her authorisation and, in the case of a loss,
claim that he had a written request from the client. Indeed, it is easy for a broker
to generate a coded order requesting to buy the stocks as the encoding key is in the
public domain. On the other hand, a broker may be concerned that if he buys the
shares by the investors request and the market goes down, the investor may claim
that she never ordered this transaction and that her coded request is a fake.
However, it is easy to develop a protocol which addresses these concerns. An
investor Alice sends to a broker Bob, together with her request p to buy shares, her
electronic signature fB fA1 (p). After receiving this message Bob sends a receipt
r encoded as fA fB1 (r). If a conflict emerges, both sides can provide a third party
(say, a court) with these coded messages and the keys. Since no-one but Alice could
generate the message coded by fB fA1 and no-one but Bob could generate the message coded by fA fB1 , no doubts would remain. This is the gist of bit commitment.
The above-mentioned RSA (RivestShamirAdelman) scheme is a prime example

470

Further Topics from Information Theory

of a public key cryptosystem. Here, a recipient user (Bob, possibly a collective

entity) sets
N = pq, where p and q are two large primes, kept secret.

(4.5.21)

Number N is often called the RSA modulus (and made public). The value of the
Euler totient function is

(N) = (p 1)(q 1), kept secret.

Next, the recipient user chooses (or is given by the key centre) an integer l such
that
1 < l < (N) and gcd ( (N), l) = 1.

(4.5.22)

Finally, an integer d is computed (again, by Bob or on his behalf) such that

1 < d < (N) and l d = 1 mod (N).

(4.5.23)

[The value of d can be computed via the extended Euclids algorithm.] The public
key eB used for encryption is the pair (N, l) (listed in the public directory). The
sender (Alice), when communicating to Bob, understands that Bobs plaintext and
ciphertext sets are P = C = {1, . . . , N 1}. She then encrypts her chosen plaintext
m = 1, . . . , N 1 as the ciphertext
EN,l (m) = c where c = ml mod N.

(4.5.24)

Bobs private key dB is the pair (N, d) (or simply number d): it is kept secret
from public but made known to Bob. The recipient decrypts ciphertext c as
Dd (c) = cd mod N.

(4.5.25)

In the literature, l is often called the encryption and d the decryption exponent.
Theorem 4.5.15 below guarantees that
Dd (c) = mdl = m mod N,

(4.5.26)

i.e. the ciphertext c is decrypted correctly. More precisely,

Theorem 4.5.15 For all integers m = 0, . . . , N 1, the equation (4.5.26) holds
true, where l and d satisfy (4.5.22) and (4.5.23) and N is as in (4.5.21).
Proof By virtue of (4.5.23),
l d = 1 + b(p 1)(q 1)
where b is an integer. Then

(q1)b
.
(ml )d = mld = m1+b(p1)(q1) = m m(p1)

4.5 Selected examples and problems from cryptography

471

Recall the EulerFermat theorem: If gcd (m, p) = 1 then m (p) = 1 mod p.

We deduce that if m is not divisible by p then
(ml )d = m mod p.

(4.5.27)

Otherwise, i.e. when p|m, (4.5.27) still holds as m and (ml )d are both equal to
0 mod p. By a similar argument,
(ml )d = m mod q.

(4.5.28)

By the Chinese remainder theorem (CRT) [28], [114] (4.5.27) and (4.5.28)
imply (4.5.26).
Example 4.5.16 Suppose Bob has chosen p = 29, q = 31, with N = 899 and
(N) = 840. The smallest possible value of e with gcd(l, (N)) = 1 is l = 11, after
that 13 followed by 17, and so on. The (extended) Euclid algorithm yields d = 611
for l = 11, d = 517 for l = 13, and so on. In the first case, the encrypting key E899,11
is
m m11 mod 899, that is, E899,11 (2) = 250.
The ciphertext 250 is decoded by
D611 (250) = 250611 mod 899 = 2,
with the help of the computer. [The computer is needed even after the simplification
rendered by the use of the CRT. For instance, the command in Mathematica is
PowerMod[250,611,899].]
Worked Example 4.5.17 (a) Referring to the RSA cryptosystem with public key
(N, l) and private key ( (N), d), discuss possible advantages or disadvantages of
taking (i) l = 232 + 1 or (ii) d = 232 + 1.
(b) Let a (large) number N be given, and we know that N is a product of two distinct
prime numbers, N = pq, but we do not know the numbers p and q. Assume that
another positive integer, m, is given, which is a multiple of (N). Explain how to
find p and q.
(c) Describe how to solve the bit commitment problem by means of the RSA.
Solution Using l = 232 + 1 provides fast encryption (you need just 33 multiplications using repeated squaring). With d = 232 + 1 one can decrypt messages quickly
(but an attacker can easily guess it).

472

Further Topics from Information Theory

(b) Next, we show that if we know a multiple m of (N) then it is easy to factor N.
Given positive integers y > 1 and M > 1, denote by ordM (y) the order of y relative
to M:

ordM (y) = min s = 1, 2, . . . : ys = 1 mod M .

Assume that m = 2a b where a 0 and b is odd. Set
+
*
X = x = 1, 2, . . . , N : ord p (xb ) = ordq (xb ) .

(4.5.29)

Given N, l and d, we put m = dl 1. As (N)|dl 1 we can use Lemma 4.5.18

below to factor N. We select x < N. Suppose gcd(x, N) = 1; otherwise the search
is already successful. The probability of finding a non-trivial factor is 1/2, so the
probability of failure after r random choices of x X is 1/2r .
(c) The bit commitment problem arises in the following case: Alice sends a message to Bob in such a way that
(i) Bob cannot read the message until Alice sends further information;
(ii) Alice cannot change the message.
A solution is to use the electronic signature: Bob cannot read the message until
Alice (later) reveals her private key. This does not violate conditions (i), (ii) and
makes it (legally) impossible for Alice to refuse acknowledging her authorship.
Lemma 4.5.18 (i) Let N = pq, m be as before, i.e. (N)|m, and
the
2tdefine
set X
b
as in (4.5.29). If x X then there exists 0 t < a such that gcd x 1, N > 1 is
a non-trivial factor of N = pq.
(ii) The cardinality X (N)/2.
Proof (i) Put y = xb mod N. The EulerFermat theorem implies that x (N)
a
1 mod N and hence y2 1 mod N. Then
ord p (xb ) and ordq (xb ) are powers of 2.
As we know, ord p (xb ) = ordq (xb ); say, ord p (xb ) < ordq (xb ). Then there exists 0
t < a such that
t

y2 1 mod p, y2 1 mod q.

t
So, gcd y2 1, N = p, as required.

(ii) By the CRT, there is a bijection

*
+

*
+ *
+
x 1, . . . , N
x mod p, x mod q 1, . . . , p 1, . . . , q ,
with
that N (p, q). Then it suffices to show that if we partition set
* the agreement
+
1, . . . , p into subsets according to the value of ord p (xb ), x X, then each subset

4.5 Selected examples and problems from cryptography

473

has size (p 1)/2. We will do this by exhibiting such a subset of size (p 1)/2.
Note that

(N)|2a b implies that there exists {1, . . . , p 1}

such that ord p ( b )is a power of 2.
In turn, the latter statement implies that
%
= ord p ( b ), odd,
b
ord p ( )
< ord p ( b ), even.
+
*
Therefore, b mod p : odd is the required subset.
Our next example of a cipher is the Rabin, or RabinWilliams cryptosystem.
Here, again, one uses the factoring problem to provide security. For this system,
the relation with the factoring problem has been proved to be mutual: knowing
the solution to the factoring problem breaks the cryptosystem, and the ability of
breaking the cryptosystem leads to factoring. [That is not so in the case with the
RSA: it is not known whether breaking the RSA enables one to solve the factoring
problem.]
In the Rabin system the recipient user (Alice) chooses at random two large
primes, p and q, with
p = q = 3 mod 4.

(4.5.30)

Furthermore:
Alices public key is N = pq; her secret key is the pair (p, q);
Alices plaintext and ciphertext are numbers m = 0, 1 . . . , N 1,
and her encryption rule is EN (m) = c where c = m2 mod N.

(4.5.31)

To decrypt a ciphertext c addressed to her, Alice computes

m p = c(p+1)/4 mod p and mq = c(q+1)/4 mod q.

(4.5.32)

Then
m p = c1/2 mod p and mq = c1/2 mod q,
i.e. m p and mq are the square roots of c mod p and mod q, respectively. In fact,

2
p1

c = c mod p;
m p = c(p+1)/2 = c(p1)/2 c = m p
at the last step the EulerFermat theorem has been used. The argument for mq is
similar. Then Alice computes, via Euclids algorithm, integers u(p) and v(q) such
that
u(p)p + v(q)q = 1.

474

Further Topics from Information Theory

Finally, Alice computes

r = u(p)pmq + v(q)qm p ] mod N
and

s = u(p)pmq v(q)qm p ] mod N.

These are four square roots of c mod N. The plaintext m is one of them. To
secure that she can identify the original plaintext, Alice may reduce the plaintext
space P, allowing only plaintexts with some special features (like the property
that their first 32 and last 32 digits are repetitions of each other), so that it becomes
unlikely that more than one square root has this feature. However, such a measure
may result in a reduced difficulty of breaking the cipher as it will be not always
true that the reduced problem is equivalent to factoring.

I have often admired the mystical way of Pythagoras

and the secret magic of numbers.
Thomas Browne (16051682), English author who wrote on
medicine, religion, science and the esoteric
Example 4.5.19 Alice uses prime numbers p = 11 and q = 23. Then N = 253.
Bob encrypts the message m = 164, with
c = m2 mod N = 78.
Alice calculates m p = 1, mq = 3, u(p) = 2, v(q) = 1. Then Alice computes
r = [u(p)pmq + v(q)qm p ] mod N = 210 and 43
s = [u(p)pmq v(q)qm p ] mod N = 164 and 89
and finds out the message m = 164 among the solutions: 1642 = 78 mod 253.
We continue with the DiffieHellman key exchange scheme. Diffie and Hellman
proposed a protocol enabling a pair of users to exchange secret keys via insecure
channels. The DiffieHellman scheme is not a public-key cryptosystem but its importance has been widely recognised since it forms a basis for the ElGamal signature cryptosystem.
The DiffieHellman protocol is related to the discrete logarithm problem (DLP):
we are given a prime number p, field F p with the multiplicative group Fp Z p1
and a generator of Fp (i.e. a primitive element in Fp ). Then, for all b Fp , there
exists a unique {0, 1, . . . , p 2} such that
b = mod p.

(4.5.33)

4.5 Selected examples and problems from cryptography

475

Then is called the discrete logarithm, mod p, of b to base ; some authors write
= dlog b mod p. Computing discrete logarithms is considered a difficult problem: no efficient (polynomial) algorithm is known, although there is no proof that
it is indeed a non-polynomial problem. [In an additive cyclic group Z/(nZ), the
DLP becomes b = mod n and is solved by the Euclid algorithm.]
The DiffieHellman protocol allows Alice and Bob to establish a common secret
key using field tables for F p , for a sufficient quantity of prime numbers p. That is,
they know a primitive element in each of these fields. They agree to fix a large
prime number p and a primitive element F p . The pair (p, ) may be publicly
known: Alice and Bob can fix p and through the insecure channel.
Next, Alice chooses a {0, 1, . . . , p 2} at random, computes
A = a mod p
and sends A to Bob, keeping a secret. Symmetrically, Bob chooses b {0, 1, . . . ,
p 2} at random, computes
B = b mod p
and sends B to Alice keeping b secret. Then
Alice computes Ba mod p and Bob computes Ab mod p,
and their secret key is the common value
K = ab = Ba = Ab mod p.
The attacker may intercept p, , A and B but knows
neither a = dlog A mod p nor b = dlog B mod p.
If the attacker can find discrete logarithms mod p then he can break the secret
key: this is the only known way to do so. The opposite question solving the
discrete logarithm problem if he is able to break the protocol remains open (it is
considered an important problem in public key cryptography).
However, like previously discussed schemes, the DiffieHellman protocol has a
particular weak point: it is vulnerable to the man in the middle attack. Here, the
attacker uses the fact that neither Alice nor Bob can verify that a given message really comes from the opposite party and not from a third party. Suppose the attacker
can intercept all messages between Alice and Bob. Suppose he can impersonate
Bob and exchange keys with Alice pretending to be Bob and at the same time impersonate Alice and exchange keys with Bob pretending to be Alice. It is necessary
to use electronic signatures to distinguish this forgery.

476

Further Topics from Information Theory

We conclude Section 4.5 with the ElGamal cryptosystem based on electronic

signatures. The ElGamal cipher can be considered a development of the Diffie
Hellman protocol. Both schemes are based on the difficulty of the discrete logarithm problem (DLP). In the ElGamal system, a recipient user, Alice, selects a
prime number p and a primitive element F p . Next, she chooses, at random, an
exponent a {0, . . . , p 2}, computes
A = a mod p
and announces/broadcasts
the triple (p, , A), her public key.
At the same time, she keeps in secret
exponent a, her private key.
Alices plaintext set P is numbers 0, 1, . . . , p 1.
Another user Bob, wishing to send a message to Alice and knowing triple
(p, , A), chooses, again at random, an exponent b {0, 1, . . . , p 2}, and computes
B = b mod p.
Then Bob lets Alice know B (which he can do by broadcasting value B). The value
B will play the role of Bobs signature. In contrast, the value b of Bobs exponent
is kept secret.
Now, to send to Alice a message m {0, 1, . . . , p 1}, Bob encrypts m by the
pair
Eb (m) = (B, c) where c = Ab m mod p.
That is, Bobs ciphertext consists of two components: the encrypted message c and
his signature B.
Clearly, values A and B are parts of the DiffieHellman protocol; in this sense
the latter can be considered as a part of the ElGamal cipher. Further, the encrypted
message c is the product of m by Ab , the factor combining part A of Alices public
key and Bobs exponent b.
When Alice receives the ciphertext (B, c) she uses her secret key a. Namely,
she divides c by Ba mod p. A convenient way is to calculate x = p 1 a: as
1 a p 2, the value x also satisfies 1 x p 2. Then Alice decrypts c by
Bx c mod p. This yields the original message m, since
b b

Bx c = b(p1a) Ab m = p1 a Ab m = Ab Ab m = m mod p.
Example 4.5.20

With p = 37, = 2 and a = 12 we have

A = a mod p = 26

4.5 Selected examples and problems from cryptography

477

and Alices public key is (p = 37, = 2, A = 26), her plaintexts are 0, 1, . . . , 36 and
private key a = 12. Assume Bob has chosen b = 32; then
B = 232 mod 37 = 4.
Suppose Bob wants to send m = 31. He encrypts m by
c = Ab m mod p = (26)32 m mod 37 = 10 31 mod 37 = 14.
Alice decodes this message as 232 = 7 and 724 = 26 mod 37,
14 232(37121) mod 37 = 14 724 = 14 26 mod 37 = 31.
Worked Example 4.5.21 Suppose that Alice wants to send the message today
to Bob using the ElGamal encryption. Describe how she does this using the prime
p = 15485863, = 6 a primitive root mod p, and her choice of b = 69. Assume
that Bob has private key a = 5. How does Bob recover the message using the
Mathematica program?
Solution Bob has public key (15485863, 6, 7776), which Alice obtains. She converts the English plaintext using the alphabet order to the numerical equivalent:
19, 14, 3, 0, 24. Since 265 < p < 266 , she can represent the plaintext message as a
single 5-digit base 26 integer:
m = 19 264 + 14 263 + 3 262 + 0 26 + 24 = 8930660.
Now she computes b = 669 = 13733130 mod 15485863, then
m ab = 8930660 777669 = 4578170 mod 15485863.
Alice sends c = (13733130, 4578170) to Bob. He uses his private key to compute
( b ) p1a = 137331301548586315 = 2620662 mod 15485863
and
( )a m ab = 2620662 4578170 = 8930660 mod 15485863,
and converts the message back to the English plaintext.
Worked Example 4.5.22 (a) Describe the RabinWilliams scheme for coding
a message x as x2 modulo a certain N . Show that, if N is chosen appropriately,
breaking this code is equivalent to factorising the product of two primes.
(b) Describe the RSA system associated with a public key e, a private key d and
the product N of two large primes.
Give a simple example of how the system is vulnerable to a homomorphism
attack. Explain how a signature system prevents such an attack. Explain how to
factorise N when e, d and N are known.

478

Further Topics from Information Theory

Solution (a) Fix two large primes p, q 1 mod 4 which forms a private key; the
broadcasted public key is the product N = pq. The properties used are:
(i) If p is a prime, the congruence a2 d mod p has at most two solutions.
(ii) For a prime p = 1 mod 4, i.e. p = 4k 1, if the congruence a2 c mod p has
a solution then a c(p+1)/4 mod p is one solution and a c(p+1)/4 mod p is
another solution. [Indeed, if c a2 mod p then, by the EulerFermat theorem,
c2k = a4k = a(p1)+2 = a2 mod p, implying ck = a.]
The message is a number m from M = {0, 1, . . . , N 1}. The encrypter (Bob) sends
(broadcasts) m = m2 mod N. The decrypter (Alice) uses property (ii) to recover the
two possible values of m mod p and two possible values of m mod q. The CRT then
yields four possible values for m: three of them would be incorrect and one correct.
So, if one can factorise N then the code would be broken. Conversely, suppose
that we can break the code. Then we can find all four distinct square roots u1 , u2 ,
u3 , u4 mod N for a general u. (The CRT plus property (i) shows that u has zero
or four square roots unless it is a multiple of p and q.) Then u j u1 (calculable via
Euclids algorithm) gives rise to the four square roots, 1, 1, 1 and 2 , of 1 mod N,
with

1 1 mod p, 1 1 mod q
and

2 1 mod p, 2 1 mod q.
By interchanging p and q, if necessary, we may suppose we know 1 . As 1 1
is divisible by p and not by q, the gcd(1 1, N) = p; that is, p can be found by
Euclids algorithm. Then q can also be identified.
In practice, it can be done as follows. Assuming that we can find square roots
mod N, we pick x at random and solve the congruence x2 y2 mod N. With probability 1/2, we have x y mod N. Then gcd(x y, N) is a non-trivial factor of N.
We repeat the procedure until we identify a factor; after k trials the probability of
success is 1 2k .
(b) To define the RSA cryptosystem let us randomly choose large primes p and q.
By Fermats little theorem,
x p1 1 mod p, xq1 1 mod q.
Thus, by writing N = pq and (N) = lcm(p 1, q 1), we have
x (N) 1 mod N,
for all integers x coprime to N.

4.5 Selected examples and problems from cryptography

479

Next, we choose e randomly. Either Euclids algorithm will reveal that e is not
co-prime to (N) or we can use Euclids algorithm to find d such that
de 1 mod (N).
With a very high probability a few trials will give appropriate d and e.
We now give out the value e of the public key and the value of N but keep secret
the private key d. Given a message m with 1 m N 1, it is encoded as the
integer c with
1 c N 1 and c me mod N.
Unless m is not co-prime to N (an event of negligible probability), we can decode
by observing that
m mde cd mod N.
As an example of a homomorphism attack, suppose the system is used to transmit a number m (dollars to be paid) and someone knowing this replaces the coded
message c by c2 . Then
(c2 )d m2de m2
and the recipient of the (falsified) message believes that m2 dollars are to be paid.
Suppose that a signature B(m) is also encoded and transmitted, where B is a
many-to-one function with no simple algebraic properties. Then the attack above
will produce a message and signature which do not correspond, and the recipient
will know that the message was tampered with.
Suppose e, d and N are known. Since

de 1 0 mod (N)
and (N) is even, de 1 is even. Thus de 1 = 2a b with b odd and a 1.
Choose x at random. Set z xb mod N. By the CRT, z is a square root of
1 mod N = pq if and only if it is a square root of 1 mod p and q. As F2 is a field,
x2 1 mod p (x 1)(x + 1) 0 mod p
(x 1) 0 mod p or (x + 1) 0 mod p.
Thus 1 has four square roots w mod N satisfying w 1 mod p and w
1 mod q. In other words,
w 1 mod N, w 1 mod N,
w w1 mod N with w1 1 mod p and w1 1 mod q
or
w w2 (mod N) with w2 1 mod p and w1 1 mod q.

480

Further Topics from Information Theory

Now z (the square root of 1 mod N) cannot satisfy z 1 mod N. If w

1 mod N, we are unlucky and try again. Otherwise we know that z + 1 is not
congruent to 0 mod N but is divisible by one of the two prime factors of N. Applying Euclids algorithm yields the common factor. Having found one prime factor,
we can find the other one by division or by looking at z 1.
Since square roots of 1 are algebraically indistinguishable, the probability of this
methods failure tends to 0 rapidly with the number of trials.

4.6 Additional problems for Chapter 4

Problem 4.1 (a) Let (Nt )t0 be a Poisson process of rate > 0 and p (0, 1).
Suppose that each jump in (Nt ) is counted as type one with probability p and type
two with probability 1 p, independently for different jumps and independently of
(1)
(2)
the Poisson process. Let Mt be the number of type-one jumps and Mt = Nt
(1)
Mt the number of type-two jumps by time t . What is the joint distribution of the
(1)
(2)
pair of processes (Mt )t0 and (Mt )t0 ? What if we fix probabilities p1 , . . . , pm
with p1 + + pm = 1 and consider m types instead of two?
(b) A person collects coupons one at a time, at jump times of a Poisson process
(Nt )t0 of rate . There are m types of coupons, and each time a coupon of type j
is obtained with probability p j , independently of the previously collected coupons
and independently of the Poisson process. Let T be the first time when a complete
set of coupon types is collected. Show that
m

P(T < t) = (1 ep j t ) .

(4.6.1)

j=1

Let L = NT be the total number of coupons collected by the time the complete
set of coupon types is obtained. Show that ET = EL. Hence, or otherwise, deduce
that EL does not depend on .
Solution Part (a) directly follows from the definition of a Poisson process.
(b) Let T j be the time of the first collection of a type j coupon. Then T j Exp(p j ),
independently for different j. We have

T = max T1 , . . . , Tm ,
and hence
m
m

P(T < t) = P max T1 , . . . , Tm < t = P(T j < t) = 1 ep j t .
j=1

j=1

4.6 Additional problems for Chapter 4

481

Next, observe that the random variable L counts the jumps in the original Poisson
process (Nt ) until the time of collecting a complete set of coupon types. That is:
L

T = Si ,
i=1

where S1 , S2 , . . . are the holding times in (Nt ), with S j Exp( ), independently for
different j. Then
E(T |L = n) = nES1 = n 1 .
Moreover, L is independent of the random variables S1 , S2 , . . .. Thus,

ET = P(L = n)E T |L = n = ES1 nP(L = n) = 1 EL.
nm

But

ET =

P(T > t)dt

0

m
p j t
dt
1 1e
=
0

=
0

j=1

1 1 ep j t dt,
m

j=1

and the RHS does not depend on .

Equivalently, L is identified as the number of collections needed for collecting
a complete set of coupons when collections occur at positive integer times t =
1, 2, . . ., with probability p j of obtaining a coupon of type j, regardless of the results
of previous collections. In this construction, does not figure, so the mean EL does
not depend on (as, in fact, the whole distribution of L).
Problem 4.2 Queuing systems are discussed in detail in PSE II. We refer to this
topic occasionally as they provide a rich source of examples in point processes.
Consider a system of k queues in series, each with infinitely many servers, in which,
for i = 1, . . . , k 1, customers leaving the ith queue immediately arrive at the (i +
1)th queue. Arrivals to the first queue form a Poisson process of rate . Service
times at the ith queue are all independent with distribution F , and independent of
service times at other queues, for all i. Assume that initially the system is empty
and write Vi (t) for the number of customers at queue i at time t 0. Show that
V1 (t), . . . ,Vk (t) are independent Poisson random variables.
In the case F(t) = 1 e t show that
EVi (t) =

P(Nt i),

t 0,

where (Nt )t0 is a Poisson process of rate .

i = 1, . . . , k,

(4.6.2)

482

Further Topics from Information Theory

Suppose now that arrivals to the first queue stop at time T . Determine the mean
number of customers at the ith queue at each time t T .
Solution We apply the product theorem to the Poisson process of arrivals with
random vectors Yn = (Sn1 , . . . , Snk ) where Sni is the service time of the nth customer
at the ith queue. Then
Vi (t) = the number of customers in the ith queue at time t

= 1 the nth customer arrived in the first queue at
n=1

time Jn is in the ith queue at time t

1
Jn > 0, Sn1 , . . . , Snk 0,

n=1

Jn + Sn1 + + Sni1 < t < Jn + Sn1 + + Sni

= 1 Jn , (Sn1 , . . . , Snk ) Ai (t) = M(Ai (t)).

n=1

Here (Jn : n N) denote the jump times of a Poisson process of rate , and the
measures M and on (0, ) Rk+ are defined by

M(A) =

(Jn ,Yn ) A , A (0, ) Rk+

n=1

and

(0,t] B = t (B).

The product theorem states that M is a Poisson random measure on (0, ) Rk+
with intensity measure . Next, the set Ai (t) (0, ) Rk+ is defined by
*
Ai (t) = ( , s1 , . . . , sk ) : 0 < < t, s1 , . . . , sk 0
+
and + s1 + + si1 t < + s1 + + si
'
=

( , s1 , . . . , sk ) : 0 < < t, s1 , . . . , sk 0
and

l=1

sl t < sl

@
.

Sets Ai (t) are pairwise disjoint for i = 1, . . . , k (as t can fall between subsei1

l=1

quent partial sums sl and sl only once). So, the random variables Vi (t) are
independent Poisson.

4.6 Additional problems for Chapter 4

483

A direct verification is through the joint MGF. Namely, let Nt Po( t) be the
number of arrivals at the first queue by time t. Then write
MV1 (t),...,Vk (t) (1 , . . . , k ) =
E exp(1V1 (t) + + kVk (t))
k

.
= E E exp iVi (t) Nt ; J1 , . . . , JNt
i=1
In turn, given n = 1, 2, . . . and points 0 < 1 < < n < t, the conditional expectations is

k

E exp iVi (t) Nt = n; J1 = 1 , . . . , Jn = n

i=1
k

n

1
k
= E exp i 1 j , (S j , . . . , S j ) Ai (t)

= E exp

i=1

j=1

1
k
i 1 j , (S j , . . . , S j ) Ai (t)
n

j=1 i=1

= E exp i 1 j , (S1j , . . . , Skj ) Ai (t) .

j=1

i=1

Next, perform summation over n and integration over 1 , . . . , n :

0t 0n 02
k

t
= e

E E exp iVi (t) Nt ; J1 , . . . , JNt

i=1
n=1
0 0
0

k
n

E exp i 1 j , (S1j , . . . , Skj ) Ai (t) d1 dn1 dn
j=1
i=1
n
0

k
t
n

t
1
k
e
=
E exp i 1 , (S , . . . , S ) Ai (t) d
0
n=1 n!
i=1

0

k
t

E exp i 1 , (S1 , . . . , Sk ) Ai (t) 1 d
= exp
0
i=1

0

t k

= exp P , (S1 , . . . , Sk ) Ai (t) ei 1 d
0

i=1

i1
i
k
0t

= exp e i 1 P Sl < t < Sl d .
0

i=1

l=1

By the uniqueness of a random variable with a given MGF, this implies that

0
Vi (t) Po

l=1

Sl < t < Sl

d , independently.

484

Further Topics from Information Theory

If F(t) = 1 e t then partial sums S1 , S1 + S2 , . . . mark the subsequent points

of a Poisson process (Ns ) of rate . In this case, EVi (t) = (Ai (t)) equals

0t
0t
i1
i

l
l
P S t < S d = P Nt = i 1 d
0

l=1

= E

1(Ns = i 1)ds =

P(Nt i).

Finally, write Vi (t, T ) for the number of customers in queue i at time t after closing
the entrance at time T . Then
EVi (t, T ) =

P(Nt = i 1)d = E

1(Ns = i 1)ds

P(Nt i) P(NtT i) .

Problem 4.3 The arrival times of customers at a supermarket form a Poisson

process of rate . Each customer spends a random length of time, S, collecting
items to buy, where S has PDF ( f (s,t) : s 0) for a customer arriving at time t .
Customers behave independently of one another. At a checkout it takes time g(S)
to buy the items collected. The supermarket has a policy that nobody should wait
at the checkout, so more tills are made available as required. Find
(i) the probability that the first customer has left before the second has arrived,
(ii) the distribution of the number of checkouts in use at time T .
Solution (i) If J1 is the arrival time of the first customer then J1 + S1 is the time he
enters the checkout till and J1 + S1 + g(S1 ) the time he leaves. Let J2 be the time of
arrival of the second customer. Then J1 , J2 J1 Exp( ), independently.
Then
P(S1 + g(S1 ) < J2 J1 ) =

dt1 e

dt2 e

dt1 e

dt1 e t1

0t2

ds1 f (s1 ,t1 )1(s1 + g(s1 ) < t2 )

ds1 f (s1 ,t1 )

dt2 e t2

s1 +g(s1 )

ds1 f (s1 ,t1 )e (s1 +g(s1 )) .

4.6 Additional problems for Chapter 4

485

(ii) Let NTch be the number of checkouts used at time T . By the product theorem,
4.4.11, NTch Po((T )) where
(T ) =

ds f (s, u)1(u + s < T, u + s + g(s) > T )

du
0

ds f (s, u)1(T g(s) < u + s < T ).

In fact, if NTarr Po( T ) is the number of arrivals by time T , then

NTch

NTarr

Ji + Si < T < Ji + Si + g(Si ) ,

i=1

and the MGF

E exp NTch = E E exp NTch |NTarr ; J1 , . . . , JNTarr

0 T 0 tk
0 t2 k

E
exp
1 ti + Si < T
= e T k

k=0

0 i=1

< ti + Si + g(Si ) dt1 dtk

k
k!
k=0

0 T
0

0 T k

E exp

0 i=1

1 ti + Si < T

< ti + Si + g(Si ) dt1 dtk

0 T

k
k
= e T
E exp 1 t + S < T < t + S + g(S)
0
k=0 k!
0T

= exp
E exp 1 t + S < T < t + S + g(S) 1 dt
0

0 T

= exp (e 1) P t + S < T < t + S + g(S) dt

0

0 T0

= exp (e 1)
f (s, u)1 u + s < T < u + s + g(s) dsdu ,
0

which verifies the claim.

Problem 4.4 A library is open from 9am to 5pm. No student may enter after
5pm; a student already in the library may remain after 5pm. Students arrive at the
library in the period from 9am to 5pm in the manner of a Poisson process of rate
. Each student spends in the library a random amount of time, H hours, where

486

Further Topics from Information Theory

0 H 8 is a random variable with PDF h and E[H] = 1. The periods of stay of

different students are IID random variables.
(a) Find the distribution of the number of students who leave the library between
3pm and 4pm.
(b) Prove that the mean number of students who leave between 3pm and 4pm is
E[min(1, (7 H)+ )], where w+ denotes max[w, 0].
(c) What is the number of students still in the library at closing time?
Solution The library is open from 9am to 5pm. Students arrive as a PP( ). The
problem is equivalent to an M/GI/ queue (until 5pm, when the restriction of no
more arrivals applies, but for problems involving earlier times this is unimportant).
Denote by Jn the arrival time of the nth student using the 24 hour clock.
Denote by Hn the time the nth student spends in the library.
Again use the product theorem, 4.4.11, for the random measure on (0, 8) (0, 8)
with atoms (Jn ,Yn ), where (Jn : n N) are the arrival times and (Yn : n N) are
periods of time that students stay in the library. Define measures on (0, ) R+
by ((0,t) B) = t (B), N(A) = 1((Jn ,Hn )A) . Then N is a Poisson random
n

measure with intensity ([0,t] [0, y]) = tF(y), where F(y) = h(x)dx (the time
0

t = 0 corresponds to 9am).
(a) Now, the number of students leaving the library between 3pm and 4pm (i.e.
6 t 7) has a Poisson distribution Po( (A)) where A = {(r, s) : s [0, 7], r
[6 s, 7 s] if s 6; r [0, 7 s] if s > 6}. Here

(A) =

08
0

dF(r)

(7r)
0 +

(6r)+

ds =

(7 r)+ (6 r)+ dF(r).

So, the distribution of students leaving the library between 3pm and 4pm is Poisson
17

with rate = [(7 y)+ (6 y)+ ]dF(r).

(b)

0,
(7 y)+ (6 y)+ = 7 y,

if y 7,
if 6 y 7,
if y 6.

4.6 Additional problems for Chapter 4

487

The mean number of students leaving the library between 3pm and 4pm is

(A) =

[min(1, (7 r)+ ]dF(r) = E[min(1, (7 H)+ )]

as required.
(c) For students still to be there at closing time we require J + H 8, as H ranges
over [0, 8], and J ranges over [8 H, 8]. Let
B = {(t, x) : t [0, 8], x [8 t, 8]}.
So,

(B) =

dt
0

dF(x) =

8x
0

dF(x)
0

(8 x)dF(x) = 8

dF(x)

dt
0

xdF(x),
0

but dF(x) = 1 and xdF(x) = E[H] = 1 imply E[H] = . Hence, the expected
number of students in the library at closing time is 7 .
Problem 4.5 (i) Prove Campbells theorem, i.e. show that if M is a Poisson
random measure on the state space E with intensity measure and a : E R is a
bounded measurable function, then

E[e

] = exp (e a(y) 1) (ddy) ,

(4.6.3)

where X = a(y)M(dy) (assume that = (E) < ).

(ii) Shots are heard at jump times J1 , J2 , . . . of a Poisson process with rate . The
initial amplitudes of the gunshots A1 , A2 , . . . Exp(2) are IID exponentially distributed with parameter 2, and the amplitutes decay linearly at rate . Compute the
MGF of the total amplitude Xt at time t :
Xt = An (1 (t Jn )+ )1(Jn t) ;
n

x+ = x if x 0 and 0 otherwise.

488

Further Topics from Information Theory

Solution (i) Conditioned on M(E) = n, the atoms of M form a random sample

1
Y1 , . . . ,Yn with distribution , so

E[e

a(Yk )

| M(E) = n] = E e k=1

e a(y) (dy)/ .

Hence,
E[e X ] = E[e X | M(E) = n]P(M(E) = n)
n

=
n

e a(y) (dy)/

e n
n!

= exp (e a(y) 1) (dy) .

(ii) Fix t and let E = [0,t] R+ and and M be such that (ds, dx) =
2 e2x dsdx, M(B) = 1{(Jn ,An )B} . By the product theorem M is a Poisson rann

dom measure with intensity measure . Set at (s, x) = x(1 (t s))+ , then
1
Xt = at (s, x)M(ds, dx). So, by Campbells theorem, for < 2,
E

E[e Xt ] = exp (e at (s,x) 1) (ds, dx)

= e t exp 2

0t 0

= e t exp 2

0t
0

by splitting integral

1t
0

1 t 1
0

ex(2 (1 (ts))+ ) dxds

0 0

= e min[t,1/ ]

ds
2 (1 (t s))+

2 + min[t, 1/ ] 2

1t
t 1

2
in the case t > 1 .

Problem 4.6 Seeds are planted in a field S R2 . The random way they are sown
means that they form a Poisson process on S with density (x, y). The seeds grow
into plants that are later harvested as a crop, and the weight of the plant at (x, y) has

4.6 Additional problems for Chapter 4

489

mean m(x, y) and variance v(x, y). The weights of different plants are independent
random variables. Show that the total weight W of all the plants is a random variable
with finite mean
00
I1 =

and variance
I2 =

00 *
S

m(x, y) (x, y) dxdy

+
m(x, y)2 + v(x, y) (x, y) dxdy ,

so long as these integrals are finite.

Solution Suppose first that

0
S

(x, y)dxdy

is finite. Then the number N of plants is finite and has the distribution Po( ).
Conditional on N, their positions may be taken as independent random variables
(Xn ,Yn ), n = 1, . . . , N, with density / on S. The weights of the plants are then
independent, with
0

EW =
S

and
EW 2 =

m(x, y) (x, y) 1 dxdy = 1 I1

0
S

m(x, y)2 + v(x, y) (x, y) 1 dxdy = 1 I2 ,

where I1 and I2 are finite. Hence,

E W |N =

1 I1 = N 1 I1

n=1

and

Var W |N =

1 I2 2 I12 = N 1 I2 2 I12 .

n=1

Then
EW = EN 1 I1 = I1
and

VarW = E Var W |N + Var E W |N

= 1 I2 2 I12 + Var N 2 I12 = I2 ,
as required.

490

Further Topics from Information Theory

If = , we divide S into disjoint Sk on which is integrable, then write W =

W
(k) where the harvests W(k) on Sk are independent, and use
k

EW = EW(k) =
k

=
S

0
Sk

m(x, y) (x, y)dxdy

and similarly for VarW .

Problem 4.7 A line L in R2 not passing through the origin O can be defined by
its perpendicular distance p > 0 from O and the angle [0, 2 ) that the perpendicular from O to L makes with the x-axis. Explain carefully what is meant by a
Poisson process of such lines L.
A Poisson process of lines L has mean measure given by

(B) =

dp d

(4.6.4)

for B (0, ) [0, 2 ). A random countable set R2 is defined to consist of all

intersections of pairs of lines in . Show that the probability that there is at least
one point of inside the circle with centre O and radius r is less than
1 (1 + 2 r)e2 r .

Is a Poisson process?
Solution Suppose that is a measure on the space L of lines in R2 not passing
through 0. A Poisson process with mean measure is a random countable subset
of L such that
(1) the number N(A) of points of in a measurable subset A of L has distribution
Po( (A)), and
(2) for disjoint A1 , . . . , An , the N(A j ) are independent.
In the problem, the number N of lines which meet the disc D of centre 0 and radius
r equals the number of lines with p < r. It is Poisson with mean
0 r 0 2
0

dpd = 2 r.

If there is at least one point of in D then there must be at least two lines of
meeting D, and this has probability
(2 r)n 2 r
e
= 1 (1 + 2 r)e2 r .
n!
n2

4.6 Additional problems for Chapter 4

491

The probability of a point of lying in D is strictly less than this, because there
may be two lines meeting D whose intersection lies outside D.
Finally, is not a Poisson process, since it has with positive probability collinear
points.
Problem 4.8 Particular cases of the PoissonDirichlet distribution for the random sequence (p1 , p2 , p3 , . . .) with parameter appeared in PSE II the definition
is given below. Show that, for any polynomial with (0) = 0,
'
@
0

(pn )

n=1

(x)x1 (1 x) 1 dx .

(4.6.5)

What does this tell you about the distribution of p1 ?

Solution The simplest way to introduce the PoissonDirichlet distribution is to say
that p = (p1 , p2 , . . .) has the same distribution as (n / ), where {n , n = 1, 2, . . .}
are the points in descending order of a Poisson process on (0, ) with rate x1 ex ,
and = n . By Campbells theorem, is a.s. finite and has distribution Gam( )
n1

(where > 0 can be arbitrary) and is independent from the vector p = (p1 , p2 , . . .)
with
p1 p2 ,

pn = 1,

with probability 1.

Here Gam stands for the Gamma distribution; see PSE I, Appendix.
To prove (4.6.5), we can take pn = n / and use the fact that and p are independent. For k 1,
0

=
0

The left side equals

xk x1 ex dx = (k).

pkn

= ( + k)( ) E

Thus,

pkn

(k)( )
=
(k + )

0 1
0

xk1 (1 x) 1 dx.

We see that the identity (4.6.5) holds for (x) = xk (with k 1) and hence by
linearity for all polynomials with (0) = 0.

492

Further Topics from Information Theory

Approximating step functions by polynomials shows that the mean number of

pn in an interval (a, b) (with 0 < a < b < 1) equals

0 b
a

x1 (1 x) 1 dx.

If a > 1/2, there can be at most one such pn , so that p1 has the PDF

x1 (1 x) 1 on (1/2, 1).
But this fails on (0, 1/2), and the identity (4.6.5) does not determine the distribution
of p1 on this interval.
Problem 4.9 The positions of trees in a large forest can be modelled as a Poisson process of constant rate on R2 . Each tree produces a random number of
seeds having a Poisson distribution with mean . Each seed falls to earth at a point
uniformly distributed over the circle of radius r whose centre is the tree. The positions of the different seeds relative to their parent tree, and the numbers of seeds
produced by a given tree, are independent of each other and of . Prove that, conditional on , the seeds form a Poisson process whose mean measure depends
on . Is the unconditional distribution of that of a Poisson process?
Solution By a direct calculation, the seeds from a tree at X form a Poisson process
with rate
'
1 r2 , |x X| < r,
X (x) =
0,
otherwise.
Superposing these independent Poisson processes gives a Poisson process with rate
(x) =

X (x);

it clearly depends on . The unrealistic assumption of a circular uniform distribution is chosen to create no doubt about this dependence in this case can be
reconstructed from the contours of .
Here we meet for the first time the doubly stochastic (Cox) processes, i.e. Poisson process with random intensity. The number of seeds in a bounded set has
mean
0

EN(A) = EE N(A)| = E (x)dx

4.6 Additional problems for Chapter 4

493

and variance

Var N(A) = E Var N(A)| + Var E N(A)|
0

= EN(A) + Var
(x)dx
> EN(A).
Hence, is not a Poisson process.
Problem 4.10 A uniform Poisson process in the unit ball of R3 is one whose
mean measure is Lebesgue measure (volume) on
B = {(x, y, z) R3 : r2 = x2 + y2 + z2 1}.

Show that
1 = {r : (x, y, z) }

is a Poisson process on [0, 1] and find its mean measure. Show that
2 = {(x/r, y/r, z/r) : (x, y, z) }

is a Poisson process on the boundary of B, whose mean measure is a multiple of

surface area. Are 1 and 2 independent processes?
Solution By the mapping theorem, 1 is Poisson, with expected number of points
in (a, b) equal to (the volume of the shell with radii a and b), i.e.

4 3 4 3

b a .
3
3
Thus, the mean measure of 1 has the PDF
4 r2 (0 < r < 1).
Similarly, the expected number of points of 2 in A B equals
1
(the conic volume from 0 to A) = (the surface area of A).
3
Finally, 1 and 2 are not independent since they have the same number of points.
Problem 4.11 The points of are coloured randomly either red or green, the
probability of any point being red being r, 0 < r < 1, and the colours of different
points being independent. Show that the red and the green points form independent
Poisson processes.

494

Further Topics from Information Theory

Solution If A S has (A) < then write

N(A) = N1 (A) + N2 (A)
where N1 and N2 are the numbers of red and green points. Conditional on N(A) = n,
N1 (A) has the binomial distribution Bin(n, r). Thus,

P N1 (A) = k, N2 (A) = l

= P N(A) = k + l P N1 (A) = k|N(A) = k + l

(A)k+l e (A) k + l k
=
r (1 r)l
(k + l)!
k
=

[r (A)]k er (A) [(1 r) (A)]l e(1r) (A)

.
k!
l!

Hence, N1 (A) and N2 (A) are independent Poisson random variables with means
r (A) and (1 r) (A), respectively.
If A1 , A2 , . . . are disjoint sets then the pairs
(N1 (A1 ), N2 (A1 )), (N1 (A2 ), N2 (A2 )), . . .
are independent, and hence

N1 (A1 ), N1 (A2 ), . . . and N2 (A1 ), N2 (A2 ), . . .
are two independent sequences of independent random variables. If (A) = then
N(A) = a.s., and since r > 0 and 1 r > 0, there are a.s. infinitely many red and
green points in A.
Problem 4.12 A model of a rainstorm falling on a level surface (taken to be the
plane R2 ) describes each raindrop by a triple (X, T,V ), where X R2 is the horizontal position of the centre of the drop, T is the instant at which the drop hits the
plane, and V is the volume of water in the drop. The points (X, T,V ) are assumed
to form a Poisson process on R4 with a given rate (x,t, v). The drop forms a wet
circular patch on the surface, with centre X and a radius that increases with time,
the radius at time (T + t) being a given function r(t,V ). Find the probability that
a point R2 is dry at time , and show that the total rainfall in the storm has
expectation
0

if this integral converges.

v (x,t, v)dxdtdv

4.6 Additional problems for Chapter 4

495

Solution Thus, R2 is wet iff there is a point of with t < and

||X || < r( t,V )
(there no problem about whether or not the inequality is strict since the difference
involves events of zero probability). The number of points of satisfying these
two inequalities is Poisson, with mean

(x,t, v)1 t < , ||x || < r( t, v) dxdtdv.

Hence, the probability that is dry is e (or 0 if = +). Finally, the formula
for the expected total rainfall,

(X,T,V )

is a direct application of Campbells theorem.

Problem 4.13 Let M be a Poisson random measure on E = R [0, ) with constant intensity . For (x, ) E , denote by l(x, ) the line in R2 obtained by rotating the line {(x, y) : y R} through an angle about the origin.
Consider the line process L = M l 1 .
(i) What is the distribution of the number of lines intersecting the disk Da = {z
R2 : | z | a}?
(ii) What is the distribution of the distance from the origin to the nearest line?
(iii) What is the distribution of the distance from the origin to the kth nearest line?
Solution (i) A line intersects the disk Da = {z R2 : |z| a} if and only if its
representative point (x, ) lies in (a, a) [0, ). Hence,
of lines intersecting Da Po(2a ).
(ii) Let Y be the distance from the origin to the nearest line. Then

P(Y a) = P M((a, a) [0, )) = 0 = exp(2a ),
i.e. Y Exp(2 ).
(iii) Let Y1 ,Y2 , . . . be the distances from the origin to the nearest line, the second
nearest line, and so on. Then the Yi are the atoms of the PRM N on R+ which is
obtained from M by the projection (x, ) |x|. By the mapping theorem, N is the
Poisson process on R+ of rate 2 . Hence, Yk Gam(k, 2 ), as Yk = S1 + +Sk
where Si Exp(2 ), independently.

496

Further Topics from Information Theory

Problem 4.14 One wishes to transmit one of M equiprobable distinct messages

through a noisy channel. The jth message is encoded by the sequence of scalars
a jt (t = 1, 2, . . . , n) which, after transmission, is received as a jt + t (t = 1, 2, . . . , n).
Here the noise random variables t are independent and normally distributed, with
zero mean and with time-dependent variance Var t = vt .
Find an inference rule at the receiver for which the average probability that the
message value is incorrectly inferred has the upper bound

1
P error
exp(d jk /8),
M 1 j
=kM

where
d jk =

(4.6.6)

>
(a jt akt )2 vt .

1tn

Suppose that M = 2 and that the transmitted waveforms are subject to the power
constraint a2jt K , j = 1, 2. Which of the two waveforms minimises the prob1tn

ability of error?
[Hint: You may assume validity of the bound P(Z a) exp(a2 /2), where Z is
a standard N(0, 1) random variable.]
Solution Let f j = fch (y|X = a j ) be the PDF of receiving a vector y given that a
waveform A j = (a jt ) was transmitted. Then
P(error)

1
P({y : fk (y) f j (y)|X = A j }).
M
j k:k= j

Let V be the diagonal matrix with the diagonal elements v j . In the present case,

1 n
2
f j = C exp (yt a jt ) /vt
2 t=1

1
T 1
= C exp (Y A j ) V (Y A j ) .
2
Then if X = A j and Y = A j + we have
1
1
log fk log f j = (A j Ak + )TV 1 (A j Ak + ) + TV 1
2
2
1
T 1
= d jk (A j (Ak ) V )
2
$
1
= d jk + d jk Z
2

4.6 Additional problems for Chapter 4

497

where Z N(0, 1). Thus, by the hint, (4.6.6) follows:

$
P( fk f j ) = P(Z > d jk /2) ed jk /8 .
In the case M = 2 we have to maximise
d12 = (A1 A2 )TV 1 (A1 A2 ) =

(a1t a2t )2 /vt

1tn

subject to

a2jt K

or (a)Tj (a) j K, j = 1, 2.

By CauchySchwarz,
(A1 A2 ) V
T

S
2
S
T
T
1
1
(A1 A2 )
A1 V A1 + A2 V A2

(4.6.7)

with equality holding when A1 = const A2 . Further, in our case V is diagonal, and
(4.6.7) is maximised when ATj A j = K, j = 1, 2, . . . We conclude that
a1t = a2t = bt
with bt non-zero only for t such that vt is minimal, and t bt2 = K.
Problem 4.15 A random variable Y is distributed on the non-negative integers.
Show that the maximum entropy of Y , subject to EY M , is
M log M + (M + 1) log(M + 1)

attained by a geometric distribution with mean M .

A memoryless channel produces outputs Y from non-negative integer-valued
inputs X by
Y = X + ,

where is independent of X , P( = 1) = p, P( = 0) = 1 p = q and inputs X are

constrained by EX q. Show that, provided p 1/3, the optimal input distribution
is

1
p r+1
1
, r = 0, 1, 2, . . . ,
P(X = r) = (1 + p)

2r+1
q
and determine the capacity of the channel.
Describe, very briefly, the problem of determining the channel capacity if p >
1/3.

498

Further Topics from Information Theory

Solution First, consider the problem

py 0,
maximise h(Y ) = py log py subject to y py = 1,

y0
y ypy = M.

The solution, found by using Lagrangian multipliers, is

py = (1 ) y , y = 0, 1, . . . , with M =

M
,
, or =
1
M+1

with the optimal value

h(Y ) = (M + 1) log(M + 1) M log M.
Next, for g(m) = (m + 1) log(m + 1) m log m,
g (m) = log(m + 1) log m > 0,
implying that h(Y ) when M . Therefore, the maximiser and the optimal value
are the same for EY M, as required.

Now, the capacity C = sup h(Y ) h(Y |X)] = h(Y ) h( ), and the condition
EX q implies that EY q + E = q + p = 1. With h( ) = p log p q log q, we
want Y geometric, with M = 1, = 1/2, yielding

C = 2 log 2 + p log p + q log q = log 4p p qq ).
Then
EzX

EzY
1 >
1
= =
(pz + q) =
Ez
1z
(2 z)(q + pz)
(2 z)1 + p(q + pz)1
=
1+ p
r

1 1+r r
1/2
z + p/q p/q zr .
=

1+ p

If p > 1/3 then p/q > 1/2 and the alternate probabilities become negative,
which means that there is no distribution for X giving an optimum for Y . Then
we would have to maximise
py log py , subject to py = py1 + qy ,
y

where y 0, y y = 1 and y yy q.
Problem 4.16 Assuming the bounds on channel capacity asserted by the second
coding theorem, deduce the capacity of a memoryless Gaussian channel.
A channel consists of r independent memoryless Gaussian channels, the noise in
the ith channel having variance vi , i = 1, 2, . . . , n. The compound channel is subject

4.6 Additional problems for Chapter 4

499

2
to an overall power constraint E xit p, for each t , where xit is the input of
i

channel i at time t . Determine the capacity of the compound channel.

Solution For the first part see Section 4.3.
If the power in the ith channel is reduced to pi , we would have capacity

1
pi

C = log 1 +
.
2 i
vi
The actual capacity is given by C = maxC subject to p1 , . . . , pr 0, i pi = p.
Thus, we have to maximise the Lagrangian

pi
1
pi ,
L = log 1 +
2 i
vi
i
with
1

L = (vi + pi )1 , i = 1, . . . , r
pi
2
and the maximum at

1
1
pi = max 0,
vi =
vi
.
2
2
+

To adjust the constraint, choose = where is determined from

v
2 i = p.
+
i
The existence and uniqueness of follows since the LHS monotonically decreases from + to 0. Thus,

1
1
.
C = log
2 i
2 vi

Problem 4.17
Here we consider random variables taking values in a given
set A (finite, countable or uncountable) whose distributions are determined by
PMFs with respect to a given reference measure . Let be a real function
and a real number. Prove that the maximum hmax (X) of the entropy h(X) =
1
fX (X) log fX (x) (dx) subject to the constraint E (X) = is achieved at the
random variable X with the PMF

1
fX (x) = exp (x)
(4.6.8a)

500

Further Topics from Information Theory

where = ( ) = exp (x) (dx) is the normalising constant and is chosen so that
0

(x)

exp (x) (dx) = .

E (X ) =
(4.6.8b)

1 (x)
exp (x) (dx) = exists.
Assume that the value with the property

Show that if, in addition, function is non-negative, then, for any given > 0,
the PMF fX from (4.6.8a), (4.6.8b) maximises the entropy h(X) under a wider
constraint E (X) .
Consequently, calculate the maximal value of h(X) subject to E (X) , in
the following cases: (i) when A is a finite set, is a positive measure on A (with
i = ({i}) = 1/ (A) where (A) = j ) and (x) 1, x A; (ii) when A is
1

an arbitrary set, is a positive measure on A with (A) < and (x) 1, x A;

(iii) when A = R is a real line, is the Lebesgue measure and (x) = |x|; (iv)
when A = Rd , is a d -dimensional Lebesgue measure and (x) = Ki j xi x j ,
1|leq jd

where K = (Ki j ) is a d d positive definite real matrix.

Solution With ln fX (x) = (x) ln , we use the Gibbs inequality:
0
0

h(X) = fX (x) ln fX (x) (dx) fX (x) (x) + ln (dx)

fX (x) (x) + ln (dx) = h(X )

with equality if and only if X X . This proves the first assertion.

If 0, the expected value E (X) 0, and is minimal when the constraint
is satisfied.

Bibliography

[1] V. Anantharam, F. Baccelli. A Palm theory approach to error exponents. In

Proceedings of the 2008 IEEE Symposium on Information Theory, Toronto,
pp. 17681772, 2008.
[2] J. Adamek. Foundations of Coding: Theory and Applications of Error-Correcting
Codes, with an Introduction to Cryptography and Information Theory. Chichester:
Wiley, 1991.
[3] D. Applebaum. Probability and Information: An Integrated Approach. Cambridge:
Cambridge University Press, 1996.
[4] R.B. Ash. Information Theory. New York: Interscience, 1965.
[5] E.F. Assmus, Jr., J.D. Key. Designs and their Codes. Cambridge: Cambridge
University Press, 1992.
[6] K.A. Arwini, C.T.J. Dodson. Information Geometry: Near Randomness and Near
Independence. Lecture notes in mathematics, 1953. Berlin: Springer, 2008.
[7] D. Augot, M. Stepanov. A note on the generalisation of the GuruswamiSudan
list decoding algorithm to ReedMuller codes. In Grobner Bases, Coding, and
Cryptography. RISC Book Series. Springer, Heidelberg, 2009.
[8] R.U. Ayres. Manufacturing and Human Labor as Information Processes. Laxenburg: International Institute for Applied System Analysis, 1987.
[9] A.V. Balakrishnan. Communication Theory (with contributions by J.W. Carlyle
et al.). New York: McGraw-Hill, 1968.
[10] J. Baylis. Error-Correcting Codes: A Mathematical Introduction. London:
Chapman & Hall, 1998.
[11] A. Betten et al. Error-Correcting Linear Codes Classification by Isometry and
Applications. Berlin: Springer, 2006.
[12] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression.
Englewood Cliffs, NJ: Prentice-Hall, 1971.
[13] E.R. Berlekamp. A Survey of Algebraic Coding Theory. Wien: Springer, 1972.
[14] E.R. Berlekamp. Algebraic Coding Theory. New York: McGraw-Hill, 1968.
[15] J. Berstel, D. Perrin. Theory of Codes. Orlando, FL: Academic Press, 1985.
[16] J. Bierbrauer. Introduction to Coding Theory. Boca Raton, FL: Chapman &
Hall/CRC, 2005.
[17] P. Billingsley. Ergodic Theory and Information. New York: Wiley, 1965.

501

502

Bibliography

[18] R.E. Blahut. Principles and Practice of Information Theory. Reading, MA:
Addison-Wesley, 1987.
[19] R.E. Blahut. Theory and Practice of Error Control Codes. Reading, MA: AddisonWesley, 1983. See also Algebraic Codes for Data Transmission. Cambridge:
Cambridge University Press, 2003.
[20] R.E. Blahut. Algebraic Codes on Lines, Planes, and Curves. Cambridge:
Cambridge University Press, 2008.
[21] I.F. Blake, R.C. Mullin. The Mathematical Theory of Coding. New York: Academic
Press, 1975.
[22] I.F. Blake, R.C. Mullin. An Introduction to Algebraic and Combinatorial Coding
Theory. New York: Academic Press, 1976.
[23] I.F. Blake (ed). Algebraic Coding Theory: History and Development. Stroudsburg,
PA: Dowden, Hutchinson & Ross, 1973.
[24] N. Blachman. Noise and its Effect on Communication. New York: McGraw-Hill,
1966.
[25] R.C. Bose, D.K. Ray-Chaudhuri. On a class of errors, correcting binary group
codes. Information and Control, 3(1), 6879, 1960.
[26] W. Bradley, Y.M. Suhov. The entropy of famous reals: some empirical results.
Random and Computational Dynamics, 5, 349359, 1997.
[27] A.A. Bruen, M.A. Forcinito. Cryptography, Information Theory, and ErrorCorrection: A Handbook for the 21st Century. Hoboken, NJ: Wiley-Interscience,
2005.
[28] J.A. Buchmann. Introduction to Cryptography. New York: Springer-Verlag, 2002.
[29] P.J. Cameron, J.H. van Lint. Designs, Graphs, Codes and their Links. Cambridge:
Cambridge University Press, 1991.
[30] J. Castineira Moreira, P.G. Farrell. Essentials of Error-Control Coding. Chichester:
Wiley, 2006.
[31] W.G. Chambers. Basics of Communications and Coding. Oxford: Clarendon, 1985.
[32] G.J. Chaitin. The Limits of Mathematics: A Course on Information Theory and the
Limits of Formal Reasoning. Singapore: Springer, 1998.
[33] G. Chaitin. Information-Theoretic Incompleteness. Singapore: World Scientific,
1992.
[34] G. Chaitin. Algorithmic Information Theory. Cambridge: Cambridge University
Press, 1987.
[35] F. Conway, J. Siegelman. Dark Hero of the Information Age: In Search of Norbert
Wiener, the Father of Cybernetics. New York: Basic Books, 2005.
[36] T.M. Cover, J.M. Thomas. Elements of Information Theory. New York: Wiley,
2006.
[37] I. Csiszar, J. Korner. Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic Press, 1981; Budapest: Akademiai Kiado,
1981.
[38] W.B. Davenport, W.L. Root. Random Signals and Noise. New York: McGraw Hill,
1958.
[39] A. Dembo, T. M. Cover, J. A. Thomas. Information theoretic inequalities. IEEE
Transactions on Information Theory, 37, (6), 15011518, 1991.

Bibliography

503

[40] R.L. Dobrushin. Taking the limit of the argument of entropy and information functions. Teoriya Veroyatn. Primen., 5, (1), 2937, 1960; English translation: Theory
of Probability and its Applications, 5, 2532, 1960.
[41] F. Dyson. The Tragic Tale of a Genius. New York Review of Books, July 14, 2005.
[42] W. Ebeling. Lattices and Codes: A Course Partially Based on Lectures by F. Hirzebruch. Braunschweig/Wiesbaden: Vieweg, 1994.
[43] N. Elkies. Excellent codes from modular curves. STOC01: Proceedings of the
33rd Annual Symposium on Theory of Computing (Hersonissos, Crete, Greece),
pp. 200208, NY: ACM, 2001.
[44] S. Engelberg. Random Signals and Noise: A Mathematical Introduction. Boca Raton, FL: CRC/Taylor & Francis, 2007.
[45] R.M. Fano. Transmission of Information: A Statistical Theory of Communication.
New York: Wiley, 1961.
[46] A. Feinstein. Foundations of Information Theory. New York: McGraw-Hill, 1958.
[47] G.D. Forney. Concatenated Codes. Cambridge, MA: MIT Press, 1966.
[48] M. Franceschetti, R. Meester. Random Networks for Communication. From Statistical Physics to Information Science. Cambridge: Cambridge University Press,
2007.
[49] R. Gallager. Information Theory and Reliable Communications. New York: Wiley,
1968.
[50] A. Gofman, M. Kelbert, Un upper bound for KullbackLeibler divergence with a
small number of outliers. Mathematical Communications, 18, (1), 7578, 2013.
[51] S. Goldman. Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1953.
[52] C.M. Goldie, R.G.E. Pinch. Communication Theory. Cambridge: Cambridge
University Press, 1991.
[53] O. Goldreich. Foundations of Cryptography, Vols 1, 2. Cambridge: Cambridge
University Press, 2001, 2004.
[54] V.D. Goppa. Geometry and Codes. Dordrecht: Kluwer, 1988.
[55] S. Gravano. Introduction to Error Control Codes. Oxford: Oxford University Press,
2001.
[56] R.M. Gray. Source Coding Theory. Boston: Kluwer, 1990.
[57] R.M. Gray. Entropy and Information Theory. New York: Springer-Verlag, 1990.
[58] R.M. Gray, L.D. Davisson (eds). Ergodic and Information Theory. Stroudsburg,
CA: Dowden, Hutchinson & Ross, 1977 .
[59] V. Guruswami, M. Sudan. Improved decoding of ReedSolomon codes and algebraic geometry codes. IEEE Trans. Inform. Theory, 45, (6), 17571767, 1999.
[60] R.W. Hamming. Coding and Information Theory. 2nd ed. Englewood Cliffs, NJ:
Prentice-Hall, 1986.
[61] T.S. Han. Information-Spectrum Methods in Information Theory. New York:
Springer-Verlag, 2002.
[62] D.R. Hankerson, G.A. Harris, P.D. Johnson, Jr. Introduction to Information Theory
and Data Compression. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2003.
[63] D.R. Hankerson et al. Coding Theory and Cryptography: The Essentials. 2nd ed.
New York: M. Dekker, 2000. (Earlier version: D. G. Hoffman et al. Coding Theory:
The Essentials. New York: M. Dekker, 1991.)
[64] W.E. Hartnett. Foundations of Coding Theory. Dordrecht: Reidel, 1974.

504

Bibliography

[65] S.J. Heims. John von Neumann and Norbert Wiener: From Mathematics to the
Technologies of Life and Death. Cambridge, MA: MIT Press, 1980.
[66] C. Helstrom. Statistical Theory of Signal Detection. 2nd ed. Oxford: Pergamon
Press, 1968.
[67] C.W. Helstrom. Elements of Signal Detection and Estimation. Englewood Cliffs,
NJ: Prentice-Hall, 1995.
[68] R. Hill. A First Course in Coding Theory. Oxford: Oxford University Press, 1986.
[69] T. Ho, D.S. Lun. Network Coding: An Introduction. Cambridge: Cambridge University Press, 2008.
[70] A. Hocquenghem. Codes correcteurs derreurs. Chiffres, 2, 147156, 1959.
[71] W.C. Huffman, V. Pless. Fundamentals of Error-Correcting Codes. Cambridge:
Cambridge University Press, 2003.
[72] J.F. Humphreys, M.Y. Prest. Numbers, Groups, and Codes. 2nd ed. Cambridge:
Cambridge University Press, 2004.
[73] S. Ihara. Information Theory for Continuous Systems. Singapore: World Scientific,
1993 .
[74] F.M. Ingels. Information and Coding Theory. Scranton: Intext Educational Publishers, 1971.
[75] I.M. James. Remarkable Mathematicians. From Euler to von Neumann.
Cambridge: Cambridge University Press, 2009 .
[76] E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics. Dordrecht:
Reidel, 1982.
[77] F. Jelinek. Probabilistic Information Theory. New York: McGraw-Hill, 1968.
[78] G.A. Jones, J.M. Jones. Information and Coding Theory. London: Springer, 2000.
[79] D.S. Jones. Elementary Information Theory. Oxford: Clarendon Press, 1979.
[80] O. Johnson. Information Theory and the Central Limit Theorem. London: Imperial
College Press, 2004.
[81] J. Justensen. A class of constructive asymptotically good algebraic codes. IEEE
Transactions Information Theory, 18(5), 652656, 1972.
[82] M. Kelbert, Y. Suhov. Continuity of mutual entropy in the large signal-to-noise
ratio limit. In Stochastic Analysis 2010, pp. 281299, 2010. Berlin: Springer.
[83] N. Khalatnikov. Dau, Centaurus and Others. Moscow: Fizmatlit, 2007.
[84] A.Y. Khintchin. Mathematical Foundations of Information Theory. New York:
Dover, 1957.
[85] T. Klove. Codes for Error Detection. Singapore: World Scientific, 2007.
[86] N. Koblitz. A Course in Number Theory and Cryptography. New York: Springer,
1993 .
[87] H. Krishna. Computational Complexity of Bilinear Forms: Algebraic Coding Theory and Applications of Digital Communication Systems. Lecture notes in control
and information sciences, Vol. 94. Berlin: Springer-Verlag, 1987.
[88] S. Kullback. Information Theory and Statistics. New York: Wiley, 1959.
[89] S. Kullback, J.C. Keegel, J.H. Kullback. Topics in Statistical Information Theory.
Berlin: Springer, 1987.
[90] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, II. Bell System Technical Journal, 6484, 1961.

Bibliography

505

[91] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, III. The dimension of the space of essentially time- and band-limited
signals. Bell System Technical Journal, 12951336, 1962.
[92] R. Lidl, H. Niederreiter. Finite Fields. Cambridge: Cambridge University Press,
1997.
[93] R. Lidl, G. Pilz. Applied Abstract Algebra. 2nd ed. New York: Wiley, 1999.
[94] E.H. Lieb. Proof of entropy conjecture of Wehrl. Commun. Math. Phys., 62, (1),
3541, 1978.
[95] S. Lin. An Introduction to Error-Correcting Codes. Englewood Cliffs, NJ; London:
Prentice-Hall, 1970.
[96] S. Lin, D.J. Costello. Error Control Coding: Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[97] S. Ling, C. Xing. Coding Theory. Cambridge: Cambridge University Press, 2004.
[98] J.H. van Lint. Introduction to Coding Theory. 3rd ed. Berlin: Springer, 1999.
[99] J.H. van Lint, G. van der Geer. Introduction to Coding Theory and Algebraic
Geometry. Basel: Birkhauser, 1988.
[100] J.C.A. van der Lubbe. Information Theory. Cambridge: Cambridge University
Press, 1997.
[101] R.E. Lewand. Cryptological Mathematics. Washington, DC: Mathematical Association of America, 2000.
[102] J.A. Llewellyn. Information and Coding. Bromley: Chartwell-Bratt; Lund:
Studentlitteratur, 1987.
[103] M. Lo`eve. Probability Theory. Princeton, NJ: van Nostrand, 1955.
[104] D.G. Luenberger. Information Science. Princeton, NJ: Princeton University Press,
2006.
[105] D.J.C. Mackay. Information Theory, Inference and Learning Algorithms.
Cambridge: Cambridge University Press, 2003.
[106] H.B. Mann (ed). Error-Correcting Codes. New York: Wiley, 1969 .
[107] M. Marcus. Dark Hero of the Information Age: In Search of Norbert Wiener, the
Father of Cybernetics. Notices of the AMS 53, (5), 574579, 2005.
[108] A. Marshall, I. Olkin. Inequalities: Theory of Majorization and its Applications.
New York: Academic Press, 1979 .
[109] V.P. Maslov, A.S. Chernyi. On the minimization and maximization of entropy in
various disciplines. Theory Probab. Appl. 48, (3), 447464, 2004.
[110] F.J. MacWilliams, N.J.A. Sloane. The Theory of Error-Correcting Codes, Vols I,
II. Amsterdam: North-Holland, 1977.
[111] R.J. McEliece. The Theory of Information and Coding. Reading, MA: AddisonWesley, 1977. 2nd ed. Cambridge: Cambridge University Press, 2002.
[112] R. McEliece. The Theory of Information and Coding. Student ed. Cambridge:
Cambridge University Press, 2004.
[113] A. Menon, R.M. Buecher, J.H. Read. Impact of exclusion region and spreading in
spectrum-sharing ad hoc networks. ACM 1-59593-510-X/06/08, 2006 .
[114] R.A. Mollin. RSA and Public-Key Cryptography. New York: Chapman & Hall,
2003.
[115] R.H. Morelos-Zaragoza. The Art of Error-Correcting Coding. 2nd ed. Chichester:
Wiley, 2006.

506

Bibliography

[116] G.L. Mullen, C. Mummert. Finite Fields and Applications. Providence, RI:
American Mathematical Society, 2007.
[117] A. Myasnikov, V. Shpilrain, A. Ushakov. Group-Based Cryptography. Basel:
Birkhauser, 2008.
[118] G. Nebe, E.M. Rains, N.J.A. Sloane. Self-Dual Codes and Invariant Theory. New
York: Springer, 2006.
[119] H. Niederreiter, C. Xing. Rational Points on Curves over Finite Fields: Theory and
Applications. Cambridge: Cambridge University Press, 2001.
[120] W.W. Peterson, E.J. Weldon. Error-Correcting Codes. 2nd ed. Cambridge,
MA: MIT Press, 1972. (Previous ed. W.W. Peterson. Error-Correcting Codes.
Cambridge, MA: MIT Press, 1961.)
[121] M.S. Pinsker. Information and Information Stability of Random Variables and Processes. San Francisco: Holden-Day, 1964.
[122] V. Pless. Introduction to the Theory of Error-Correcting Codes. 2nd ed. New York:
Wiley, 1989.
[123] V.S. Pless, W.C. Huffman (eds). Handbook of Coding Theory, Vols 1, 2. Amsterdam: Elsevier, 1998.
[124] P. Piret. Convolutional Codes: An Algebraic Approach. Cambridge, MA: MIT
Press, 1988.
[125] O. Pretzel. Error-Correcting Codes and Finite Fields. Oxford: Clarendon Press,
1992; Student ed. 1996.
[126] T.R.N. Rao. Error Coding for Arithmetic Processors. New York: Academic Press,
1974.
[127] M. Reed, B. Simon. Methods of Modern Mathematical Physics, Vol. II. Fourier
analysis, self-adjointness. New York: Academic Press, 1975.
[128] A. Renyi. A Diary on Information Theory. Chichester: Wiley, 1987; initially published Budapest: Akademiai Kiado, 1984.
[129] F.M. Reza. An Introduction to Information Theory. New York: Constable, 1994.
[130] S. Roman. Coding and Information Theory. New York: Springer, 1992.
[131] S. Roman. Field Theory. 2nd ed. New York: Springer, 2006.
[132] T. Richardson, R. Urbanke. Modern Coding Theory. Cambridge: Cambridge University Press, 2008.
[133] R.M. Roth. Introduction to Coding Theory. Cambridge: Cambridge University
Press, 2006.
[134] B. Ryabko, A. Fionov. Basics of Contemporary Cryptography for IT Practitioners.
Singapore: World Scientific, 2005.
[135] W.E. Ryan, S. Lin. Channel Codes: Classical and Modern. Cambridge: Cambridge
University Press, 2009.
[136] T. Schurmann, P. Grassberger. Entropy estimation of symbol sequences. Chaos, 6,
(3), 414427, 1996.
[137] P. Seibt. Algorithmic Information Theory: Mathematics of Digital Information Processing. Berlin: Springer, 2006.
[138] C.E. Shannon. A mathematical theory of cryptography. Bell Lab. Tech. Memo.,
1945.
[139] C.E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, July, October, 379423, 623658, 1948.

Bibliography

507

[140] C.E. Shannon: Collected Papers. N.J.A. Sloane, A.D. Wyner (eds). New York:
IEEE Press, 1993.
[141] C.E. Shannon, W. Weaver. The Mathematical Theory of Communication. Urbana,
IL: University of Illinois Press, 1949.
[142] P.C. Shields. The Ergodic Theory of Discrete Sample Paths. Providence, RI:
American Mathematical Society, 1996.
[143] M.S. Shrikhande, S.S. Sane. Quasi-Symmetric Designs. Cambridge: Cambridge
University Press, 1991.
[144] S. Simic. Best possible global bounds for Jensen functionals. Proc. AMS, 138, (7),
24572462, 2010.
[145] A. Sinkov. Elementary Cryptanalysis: A Mathematical Approach. 2nd ed. revised
and updated by T. Feil. Washington, DC: Mathematical Association of America,
2009.
[146] D. Slepian, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, Vol. I. Bell System Technical Journal, 4364, 1961 .
[147] W. Stallings. Cryptography and Network Security: Principles and Practice. 5th ed.
Boston, MA: Prentice Hall; London: Pearson Education, 2011.
[148] H. Stichtenoth. Algebraic Function Fields and Codes. Berlin: Springer, 1993.
[149] D.R. Stinson. Cryptography: Theory and Practice. 2nd ed. Boca Raton, FL;
London: Chapman & Hall/CRC, 2002.
[150] D. Stoyan, W.S. Kendall. J. Mecke. Stochastic Geometry and its Applications.
Berlin: Academie-Verlag, 1987 .
[151] C. Schlegel, L. Perez. Trellis and Turbo Coding. New York: Wiley, 2004.
Sujan.

[152] S.
Ergodic Theory, Entropy and Coding Problems of Information Theory.
Praha: Academia, 1983.
[153] P. Sweeney. Error Control Coding: An Introduction. New York: Prentice Hall,
1991.
[154] Te Sun Han, K. Kobayashi. Mathematics of Information and Coding. Providence,
RI: American Mathematical Society, 2002.
[155] T.M. Thompson. From Error-Correcting Codes through Sphere Packings to Simple
Groups. Washington, DC: Mathematical Association of America, 1983.
[156] R. Togneri, C.J.S. deSilva. Fundamentals of Information Theory and Coding
Design. Boca Raton, FL: Chapman & Hall/CRC, 2002.
[157] W. Trappe, L.C. Washington. Introduction to Cryptography: With Coding Theory.
2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2006.
[158] M.A. Tsfasman, S.G. Vladut. Algebraic-Geometric Codes. Dordrecht: Kluwer
Academic, 1991.
[159] M. Tsfasman, S. Vladut, T. Zink. Modular curves, Shimura curves and Goppa
codes, better than VarshamovGilbert bound. Mathematics Nachrichten, 109,
2128, 1982.
[160] M. Tsfasman, S. Vladut, D. Nogin. Algebraic Geometric Codes: Basic Notions.
Providence, RI: American Mathematical Society, 2007.
[161] M.J. Usher. Information Theory for Information Technologists. London: Macmillan, 1984.
[162] M.J. Usher, C.G. Guy. Information and Communication for Engineers. Basingstoke: Macmillan, 1997

508

Bibliography

[163] I. Vajda. Theory of Statistical Inference and Information. Dordrecht: Kluwer, 1989.
[164] S. Verdu. Multiuser Detection. New York: Cambridge University Press, 1998.
[165] S. Verdu, D. Guo. A simple proof of the entropypower inequality. IEEE Trans.
Inform. Theory, 52, (5), 21652166, 2006.
[166] L.R. Vermani. Elements of Algebraic Coding Theory. London: Chapman & Hall,
1996.
[167] B. Vucetic, J. Yuan. Turbo Codes: Principles and Applications. Norwell, MA:
Kluwer, 2000.
[168] G. Wade. Coding Techniques: An Introduction to Compression and Error Control.
Basingstoke: Palgrave, 2000.
[169] J.L. Walker. Codes and Curves. Providence, RI: American Mathematical Society,
2000.
[170] D. Welsh. Codes and Cryptography. Oxford, Oxford University Press, 1988.
[171] N. Wiener. Cybernetics or Control and Communication in Animal and Machine.
Cambridge, MA: MIT Press, 1948; 2nd ed: 1961, 1962.
[172] J. Wolfowitz. Coding Theorems of Information Theory. Berlin: Springer, 1961; 3rd
ed: 1978.
[173] A.D. Wyner. The capacity of the band-limited Gaussian channel. Bell System Technical Journal, 359395, 1996 .
[174] A.D. Wyner. The capacity of the product of channels. Information and Control,
423433, 1966.
[175] C. Xing. Nonlinear codes from algebraic curves beating the TsfasmanVladut
Zink bound. IEEE Transactions Information Theory, 49, 16531657, 2003.
[176] A.M. Yaglom, I.M. Yaglom. Probability and Information. Dordrecht, Holland:
Reidel, 1983.
[177] R. Yeung. A First Course in Information Theory. Boston: Kluwer Academic, 1992;
2nd ed. New York: Kluwer, 2002.

Index

additive stream cipher, 463

algebra (a commutative ring and a linear space),
318
group algebra, 317
polynomial algebra, 214
-algebra, 440
algebraic-geometric code, 340
algorithm, 9
BerlekampMassey (BM) decoding algorithm for
BCH codes, 240
Berlekamp-Massey (BM) algorithm for solving
linear equations, 460
division algorithm for polynomials, 214
Euclid algorithm for integers, 473
extended Euclid algorithm for integers, 470
Euclid algorithm for polynomials, 242
GuruswamiSudan (GS) decoding algorithm for
ReedSolomon codes, 298
Huffman encoding algorithm, 9
alphabet, 3
source alphabet, 8
coder (encoding) alphabet, 3
channel input alphabet, 60
channel output alphabet, 65
asymptotic equipartition property, 44
asymptotically good sequence of codes, 78
automorphism, 283
band-limited signal, 411
bandwidth, 409
basis, 149, 184
BCH (BoseRay-ChaudhuriHocquenghem) bound,
or BCH theorem, 237, 295
BCH code, 213
BCH code in a narrow sense, 235
binary BCH code in a narrow sense, 235
Bernoulli source, 3
bit (a unit of entropy), 9
bit commitment cryptosystem, 468

bound
BCH bound, 237, 295
Elias bound, 177
Gilbert bound, 198
GilbertVarshamov bound, 154
Griesmer bound, 197
Hamming bound, 150
Johnson bound, 177
linear programming bound, 322
Plotkin bound, 155
Singleton bound, 154
bar-product, 152
capacity, 61
capacity of a discrete channel, 61
capacity of a memoryless Gaussian channel with
white noise, 374
capacity of a memoryless Gaussian channel with
coloured noise, 375
operational channel capacity, 102
character (as a digit or a letter or a symbol), 53
character (of a homomorphism), 313
modular character, 314
trivial, or principal, character, 313
character transform, 319
characteristic of a field, 269
channel, 60
additive Gaussian channel (AGC), 368
memoryless Gaussian channel (MGC), 368
memoryless additive Gaussian channel (MAGC),
366
memoryless binary channel (MBC), 60
memoryless binary symmetric channel (MBSC), 60
noiseless channel, 103
channel capacity, 61
operational channel capacity, 102
check matrix: see parity-check matrix
cipher (or a cryptosystem), 463
additive stream cipher, 463

509

510
cipher (or a cryptosystem) (cont.)
one-time pad cipher, 466
public-key cipher, 467
ciphertext, 468
code, or encoding, viii, 4
alternant code, 332
BCH code, 213
binary code, 10, 95
cardinality of a code, 253
cyclic code, 216
decipherable code, 14
dimension of a linear code, 149
D error detecting code, 147
dual code, 153
equivalent codes, 190
E error correcting code, 147
Golay code, 151
Goppa code, 160, 334
Hamming code, 199
Huffman code, 9
information rate of a code, 147
Justesen code, 240, 332
lossless code, 4
linear code, 148
maximal distance separating (MDS), 155
parity-check code, 149
perfect code, 151
prefix-free code, 4
random code, 68, 372
rank of a linear code, 184
ReedMuller (RM) code, 203
ReedSolomon code, 256, 291
repetition code, 149
reversible cyclic code, 230
self-dual code, 201, 227
self-orthogonal, 227
symplex code, 194
codebook, 67
random codebook, 68
coder, or encoder, 3
codeword, 4
random codeword, 6
coding: see encoding
coloured noise, 374
concave, 19, 32
strictly concave, 32
concavity, 20
conditionally independent, 26
conjugacy, 281
conjugate, 229
convergence almost surely (a.s.), 131
convergence in probability, 43
convex, 32
strictly convex, 104
convexity, 142

Index
core polynomial of a field, 231
coset, 192
cyclotomic coset, 285
leader of a coset, 192
cryptosystem (or a cipher), 468
bit commitment cryptosystem, 468
ElGamal cryptosystem, 475
public key cryptosystem, 468
RSA (RivestShamirAdelman) cryptosystem,
468
Rabin, or RabinWilliams cryptosystem, 473
cyclic group, 231
generator of a cyclic group
cyclic shift, 216
data-processing inequality, 80
detailed balance equations (DBEs), 56
decoder, or a decoding rule, 65
geometric (or minimal distance) decoding rule, 163
ideal observer (IO) decoding rule, 66
maximum likelihood (ML) decoding rule, 66
joint typicality (JT) decoder, 372
decoding, 167
decoding alternant codes, 337
decoding BCH codes, 239, 310
decoding cyclic codes, 214
decoding Hamming codes, 200
list decoding, 192, 405
decoding ReedMuller codes, 209
decoding ReedSolomon codes, 292
decoding ReedSolomon codes by the
GuruswamiSudan algorithm, 299
syndrome decoding, 193
decrypt function, 469
degree of a polynomial, 206, 214
density of a probability distribution (PDF), 86
differential entropy, 86
digit, 3
dimension, 149
dimension of a code, 149
dimension of a linear representation, 314
discrete Fourier transform (FFT), 296
discrete-time Markov chain (DTMC), 1, 3
discrete logarithm, 474
distributed system, or a network (of transmitters), 436
Dirac -function, 318
distance, 20
KullbackLeibler distance, 20
Hamming distance, 144
minimal distance of a code, 147
distance enumerator polynomial, 322
divisor, 217
greatest common divisor (gcd), 223
dot-product, 153
doubly stochastic (Cox) random process, 492

Index
electronic signature, 469, 476
encoding, or coding, vii, 4
Huffman encoding, 9
ShannonFano encoding, 9
random coding, 67
entropy, vii, 7
axiomatic definition of entropy, 36
binary entropy, 7
conditional entropy, 20
differential entropy, 86
entropy of a random variable, 18
entropy of a probability distribution, 18
joint entropy, 20
mutual entropy, 28
entropypower inequality, 92
q-ary entropy, 7
entropy rate, vii, 41
relative entropy, 20
encrypt function, 468
ergodic random process (stationary), 397
ergodic transformation of a probability space, 397
error locator, 311
error locator polynomial, 239, 311
error-probability, 58
extension of a code, 151
parity-check extension, 151
extension field, 261
factor (as a divisor), 39
irreducible factor, 219
prime factor, 39
factorization, 230
fading of a signal, 447
power fading, 447
Rayleigh fading, 447
feedback shift register, 453
linear feedback shift register (LFSR), 454
feedback polynomial, 454
field (a commutative ring with inverses), 146, 230
extension field, 261
Galois field, 272
finite field, 194
polynomial field, 231
primitive element of a field, 230, 232
splitting field, 236, 271
Frobenius map, 283
Gaussian channel, 366
additive Gaussian channel (AGC), 368
memoryless Gaussian channel (MGC), 368
memoryless additive Gaussian channel (MAGC),
366
Gaussian coloured noise, 374
Gaussian white noise, 368
Gaussian random process, 369
generating matrix, 185

511

generator (of a cyclic code), 218

minimal degree generator polynomial, 218
generator (of a cyclic group), 232
geometric (or minimal distance) decoding rule, 163
group, 146
group algebra, 317
commutative, or Abelian, group, 146
cyclic group, 231
linear representation of a group, 314
generalized function, 412
greatest common divisor (gcd), 223
ideal observer (IO) decoding rule, 66
ideal of a ring, 217
principal ideal, 219
identity (for weight enumerator polynomials), 258
abstract MacWilliams identity, 315
MacWilliams identity for a linear code, 258, 313
independent identically distributed (IID) random
variables, 1, 3
inequality, 4
BrunnMinkovski inequality, 93
CauchySchwarz inequality, 124
Chebyshev inequality, 128
data-processing inequality, 80
entropypower inequality, 92
Fano inequality, 25
generalized Fano inequality, 27
Gibbs inequality, 17
Hadamard inequality, 91
Kraft inequality, 4
KyFan inequality, 91
log-sum inequality, 103
Markov inequality, 408
pooling inequalities, 24
information, 2, 18
mutual information, or mutual entropy, 28
information rate, 15
information source (random source), 2, 44
Bernoulli information source, 3
Markov information source, 3
information symbols, 209
initial fill, 454
intensity (of a random measure), 437
intensity measure, 437
joint entropy, 20
joint input/output distribution (of a channel), 67
joint typicality (JT) decoder, 372
key (as a part of a cipher), 466
decoding key (a label of a decoding, or
decrypting, map), 469
encoding key (a label of an encoding, or
encrypting, map), 468
random key of a one-pad cipher, 466

512

Index

key (as a part of a cipher) (cont.)

private key, 470
public key, 469
secret key, 473
KarhunenLoeve decomposition, 426
law of large numbers, 34
strong law of large numbers, 438
leader of a coset, 192
least common multiple (lcm), 223
lemma
BorelCantelli lemma, 418
NyquistShannonKotelnikovWhittaker lemma,
431
letter, 2
linear code, 148
linear representation of a group, 314
space of a linear representation, 314
dimension of a linear representation, 314
linear space, 146
linear subspace, 148
linear feedback shift register (LFSR), 454
auxiliary, or feedback, polynomial of an LFSR, 454
Markov chain, 1, 3
discrete-time Markov chain (DTMC), 1, 3
coupled Markov chain, 50
irreducible and aperiodic Markov chain, 128
kth-order Markov chain approximation, 407
second-order Markov chain, 131
transition matrix of a Markov chain, 3
Markov inequality, 408
Markov property, 33
strong Markov property, 50
Markov source, 3
stationary Markov source, 3
Markov triple, 33
Matern process (with a hard core), 451
first model of the Matern process, 451
second model of the Matern process, 451
matrix, 13
covariance matrix, 88
generating matrix, 185
generating check matrix, canonical, or standard,
form of, 189
parity-check matrix, 186
parity-check matrix, canonical, or standard, form
of, 189
parity-check matrix of a Hamming code, 191
positive definite matrix, 91
recursion matrix, 174
Toplitz matrix, 93
transition matrix of a Markov chain, 3
transition matrix, doubly stochastic, 34
Vandermonde matrix, 295
maximum likelihood (ML) decoding rule, 66

measure (as a countably additive function of a set),

366
intensity (or mean) measure, 436
non-atomic measure, 436
Poisson random measure, 436
product-measure, 371
random measure, 436
reference measure, 372
-finite, 436
Mobius function, 277
Mobius inversion formula, 278
moment generating function, 442
network: see distributed system
supercritical network, 449
network information theory, 436
noise (in a channel), 2, 70
Gaussian coloured noise, 374
Gaussian white noise, 368
noiseless channel, 103
noisy (or fully noisy) channel, 81
one-time pad cipher, 466
operational channel capacity, 102
order of an element, 267
order of a polynomial, 231
orthogonal, 185
ortho-basis, 430
orthogonal complement, 185
orthoprojection, 375
self-orthogonal, 227
output stream of a register, 454
parity-check code, 149
parity-check extension, 151
parity-check matrix, 186
plaintext, 468
Poisson process, 436
Poisson random measure, 436
polynomial, 206
algebra, polynomial, 214
degree of a polynomial, 206, 214
distance enumerator polynomial, 322
error locator polynomial, 239
Goppa polynomial, 335
irreducible polynomial, 219
MattsonSolomon polynomial, 296
minimal polynomial, 236
order of a polynomial, 231
reducible polynomial, 221
primitive polynomial, 230, 267
Kravchuk polynomial, 320
weight enumerator polynomial, 319, 351
probability distribution, vii, 1
conditional probability, 1
probability density function (PDF), 86

Index
equiprobable, or uniform, distribution, 3, 22
exponential distribution (with exponential density),
89
geometric distribution, 21
joint probability, 1
multivariate normal distribution, 88
normal distribution (with univariate normal
density), 89
Poisson distribution, 101
probability mass function (PMF), 366
probability space, 397
prolate spheroidal wave function (PSWF), 425
protocol of a private communication, 469
DiffieHellman protocol, 474
prefix, 4
prefix-free code, 4
product-channel, 404
public-key cipher, 467
quantum mechanics, 431
random code, 68, 372
random codebook, 68
random codeword, 6
random measure, 436
Poisson random measure (PRM), 436
random process, vii
Gaussian random process, 369
Poisson random process, 436
stationary random process, 397
stationary ergodic random process, 397
random variable, 18
conditionally independent random variables, 26
equiprobable, or uniform, random variable,
3, 22
exponential random variable (with exponential
density), 89
geometric random variable, 21
independent identically distributed (IID) random
variables, 1, 3
joint probability distribution of random variables, 1
normal random variable (with univariate normal
density), 89
Poisson random variable, 101
random vector, 20
multivariate normal random vector, 88
rank of a code, 184
rank-nullity property, 186
rate, 15
entropy rate, vii, 41
information rate of a source, 15
reliable encoding (or encodable) rate, 15
reliable transmission rate, 62
reliable transmission rate with regional constraint,
373
regional constraint for channel capacity, 367

513

register, 453
feedback shift register, 453
linear feedback shift register (LFSR), 454
feedback, or auxiliary, polynomial of an LFSR, 454
initial fill of register, 454
output stream of a register, 454
repetition code, 149
repetition of a code, 152
ring, 217
ideal of a ring, 217
quotient ring, 274
root of a cyclic code, 230
defining root of a cyclic code, 233
root of a polynomial, 228
root of unity, 228
primitive root of unity, 236
sample, viii, 2
signal/noise ratio (SNR), 449
sinc function, 413
size of a code, 147
space, 35
Hamming space, 144
space 2 (R1 ), 415
linear space, 146
linear subspace, 148
space of a linear representation, 314
state space of a Markov chain, 35
vector space over a field, 269
stream, 463
strictly concave, 32
strictly convex, 104
string, or a word (of characters, digits, letters or
symbols), 3
source of information (random), 2, 44
Bernoulli source, 3
equiprobable Bernoulli source, 3
Markov source, 3
stationary Markov source, 3
spectral density, 417
stationary, 3
stationary Markov source, 3
stationary random process, 397
stationary ergodic random process, 397
supercritical network, 449
symbol, 2
syndrome, 192
theorem
BrunnMinkovski theorem, 93
Campbell theorem, 442
CayleyHamilton theorem, 456
central limit theorem (CLT), 94
DoobLevy theorem, 409
local De MoivreLaplace theorem, 53
mapping theorem, 437

514

Index

theorem (cont.)
product theorem, 444
Shannon theorem, 8
Shannons noiseless coding theorem (NLCT), 8
Shannons first coding theorem (FCT), 42
Shannons second coding theorem (SCT), or noisy
coding theorem (NCT), 59, 162
Shannons SCT: converse part, 69
Shannons SCT: strong converse part, 175
Shannons SCT: direct part, 71, 163
ShannonMcMillanBreiman theorem, 397
totient function, 270
transform
character transform, 319
Fourier transform, 296

Fourier transform, discrete, 296

Fourier transform in 2 , 413
transmitter, 443
uncertainty principle, 431
Vandermonde determinant, 237
Vandermonde matrix, 297
wedge-product, 149
weight enumerator polynomial, 319
white noise, 368
word, or a string (of characters, digits, letters or
symbols), 3
weight of a word, 144