Science of Information
Science of Information
D
r. Benjamin Schumacher is
Professor of Physics at Kenyon
College, where he has taught
for 25 years. He received
his B.A. from Hendrix College and his
Ph.D. in Theoretical Physics from The
University of Texas at Austin, where he
was the last doctoral student of John
Archibald Wheeler.
For The Great Courses, Dr. Schumacher also has taught Black Holes,
Tides, and Curved Spacetime: Understanding Gravity; Quantum
Mechanics: The Physics of the Microscopic World; and Impossible:
Physics beyond the Edge.
INTRODUCTION
LECTURE GUIDES
LECTURE 1
The Transformability of Information .............................................................................4
LECTURE 2
Computation and Logic Gates ......................................................................................17
LECTURE 3
Measuring Information .................................................................................................. 26
LECTURE 4
Entropy and the Average Surprise ............................................................................ 34
LECTURE 5
Data Compression and Prex-Free Codes ............................................................. 44
LECTURE 6
Encoding Images and Sounds .....................................................................................57
LECTURE 7
Noise and Channel Capacity ...................................................................................... 69
LECTURE 8
Error-Correcting Codes ................................................................................................ 82
LECTURE 10
Cryptography and Key Entropy ..................................................................................110
LECTURE 11
Cryptanalysis and Unraveling the Enigma ..............................................................119
LECTURE 12
Unbreakable Codes and Public Keys ......................................................................130
LECTURE 13
What Genetic Information Can Do ...........................................................................140
LECTURE 14
Lifes Origins and DNA Computing ..........................................................................152
LECTURE 15
Neural Codes in the Brain ...........................................................................................169
LECTURE 16
Entropy and Microstate Information.........................................................................185
LECTURE 17
Erasure Cost and Reversible Computing ...............................................................198
LECTURE 18
Horse Races and Stock Markets ...............................................................................213
LECTURE 19
Turing Machines and Algorithmic Information .....................................................226
LECTURE 20
Uncomputable Functions and Incompleteness...................................................239
iv Table of Contents
LECTURE 21
Qubits and Quantum Information ............................................................................253
LECTURE 22
Quantum Cryptography via Entanglement ...........................................................266
LECTURE 23
It from Bit: Physics from Information ........................................................................281
LECTURE 24
The Meaning of Information ......................................................................................293
SUPPLEMENTAL MATERIAL
Table of Contents v
vi
The Science of Information
From Language to Black Holes
W
e live in the age of information. But what is information,
and what are its laws? Claude Shannons 1948 creation of
information theory launched a revolution in both science
and technology. The key property of information is its
amazing transformability. Any single message can take on dozens of
different physical forms, from marks on paper to radio waves, and any
type of information can be encoded as a series of binary digits, or bits.
Combined with revolutionary advances in electronics, Shannons ideas
changed the world.
Modern information theory has gone beyond Shannon in two ways. The
rst, algorithmic information theory, is based on computation rather than
communication. The algorithmic entropy of a binary sequence is the
length of the shortest computer program producing that sequence. This
concept of information has remarkable connections to the unsolvable
mathematical problems of Kurt Gdel and the uncomputable functions
of Alan Turing.
2 Course Scope
Quantum information, the information carried by microscopic particles,
has concepts and rules quite different from Shannons theory. It is
measured in qubits. Such information cannot be cloned, but it allows
new and more powerful types of computing. Quantum entanglement,
an exclusive information relationship between qubits, can be used as
a basis for perfectly secure quantum cryptography. The information
aspects of quantum physics, and other information-related insights from
black hole physics and cosmology, have inspired such physicists as John
Wheeler to speculate on an information basis for all of physical reality: it
from bit.
W
e live in a revolutionary age: the age of information.
Never before in history have we been able to acquire,
record, communicate, and use so many different forms
of information. Never before have we had access to such
vast quantities of data of every kind. Never before have been able to
translate information from one form to another so easily and quickly. And,
of course, never before have we had such a need to comprehend the
concepts and principles of information.
Dening Information
In the same way, the idea of information is not about the value or
signicance of a message. Meaning, value, and signicance are
obviously important qualities, but they are not the keys to understanding
what information is.
Shannon pointed out that bits form a kind of universal currency for
information. Our basic alphabet needs only two distinct symbols. And
because we can transform informationfreely switch from one code to
anotherany sort of information, such as numbers, words, or pictures,
can be represented by bits: arrangements of binary digits, 0s and 1s.
Why does the teletype code use 5 bits? The answer involves a
fundamental fact of information theory.
A code represents different messages by a series or string of
symbolsin this case, the ve binary digits. This is the codeword
that represents the message. But the code must preserve the
information, that is, the distinction between messages, so that no
two different messages can be represented by the same codeword.
That means that the number of possible codewords in the code can
be no smaller than the number of possible messages (designated
M ). In other words: (# of possible codewords) M. Thats our
fundamental fact about codes.
The fact that the number of codewords (32) is greater than the
number of possible messages (27) means that the 5-bit system
works. If we tried to get by with only 4 bits, we would have only
24 = 16 available codewords, which wouldnt be enough.
7KHUVWELQDU\FRGHZDVWKH%UDLOOHDOSKDEHW
These days, most text information is stored in computers using a 7-bit
code called the American Standard Code for Information Interchange
(ASCII). There are 27 = 128 ASCII codewordsmore than enough to
represent any character on a keyboard. Because the bits in a modern
computer are generally grouped into bytes of 8 bits, one 7-bit ASCII
codeword is usually assigned 1 byte of computer memory.
One of the rst people to understand this fact was the Hungarian-
American mathematician John von Neumann. In the years following
World War II, von Neumann was thinking about computers and robots
both the technical issues of how to build them and the basic principles
on which they operate. One question von Neumann considered was
this: Would it be possible to build a self-reproducing machine?
Imagine a robot that lives in a warehouse full of machine parts. The
robot grabs components from the shelves and puts them together,
eventually assembling an exact duplicate of itself. Von Neumann
wondered how such a machine would work.
Von Neumann gured out that his machine had to work in a different
waya way that relies on the two-sided nature of information.
Von Neumann machines already exist in nature; they are called living
things. All kinds of biological organisms produce offspring of the same
basic design as themselves. Thus, von Neumanns insight tells us
something signicant about how information works in living systems.
We now know that the genetic information in an organism is
contained in its DNA; the sequence of chemical bases in a DNA
molecule is a kind of blueprint for the organism.
Flip-Flop Circuit
Data
Since the mid-20th century, the transistor has become the basis for more
and more amazing technologies, especially those for communication
and information processing. Those technologies have become almost
unimaginably powerful and complex.
One way to chart this progress is the famous Moores law, according
to which the number of transistors on one integrated circuit has
approximately doubled every two years. That rate of growth has
continued, without interruption, for the last 50 years.
READINGS
QUESTIONS
&RPSXWDWLRQDQG/RJLF*DWHV 2
I
n Shannons information theory, the essence of information is the
distinction between different possible messages. The simplest
possible distinctionthe atom of informationis the bit, the binary
distinction between 1 and 0, yes and no, or on and off. As Shannon
pointed out, every sort of informationnumbers, text, sound, images,
and videocan be communicated by means of bits. In thinking about
the power of bits, Shannon was following up an idea that he himself had
introduced a decade earlier in his masters thesis at MIT. In it, he laid the
basic groundwork for an amazing technological advance: the design of a
computer out of electrical circuits, rather than mechanical parts.
The
The
h Science
Scien
Sc ence
cee of
of Information:
Inf
Inf
nfo
form
orm
mati
ation:
onn From
o Language to Bla
Black
ack Hol
Holes
ess 17
7
In 1821, an English mathematician and
scientist named Charles Babbage, ge,
unhappy with the system of tables,,
proposed a completely mechanical
calculation machinea device that
could do calculations with perfect
accuracy and print results without
mistakes. Babbages machine
was called the difference engine.
Difference Engine
18
8 LECTURE
LEC
LE
EC
E CTUR
TURE
TUR
UR
U RE 22Co
2Computation
mpu
Co putat
tat
aation
on an
andd LLogic
oggic Gate
ogi G
Gates
ate
tees
Analytical Engine
Babbages dream remained just that during his lifetime and for long
afterward. His most important ideas, such as the analytical engine,
were largely forgotten until they were reinvented in different form a
century later. But the difference engine, though it was never completed,
inspired generations of mechanical calculating machines, including the
differential analyzer.
The name differential analyzer actually refers to several machines
built during the late 19th and early 20th centuries, all of them
designed to solve differential equationsthe governing equations
in many branches of physics and engineering.
Boolean Algebra
A logic gate stands for both a Boolean operation and a piece of electrical
circuitry. Shannons idea was that these are really the same things. One
or more inputs1-bit messagesarrive from one side, and one or more
1-bit outputs emerge from the other side.
The next logic gate is the AND operation. Given two variables, a
and b, then a AND b is true only when both a and b are true.
We can get a sense of how this works by designing a circuit that can add
a pair of 1-bit numbers together (a half-adder circuit) and one that can
add three 1-bit numbers together (a full-adder circuit).
These same ideas, very far extended, are the basis for the design of
modern computer processors.
A single processing chip might contain hundreds of millions of
logic gates, each one consisting of a few microscopic transistors,
connected by wires only a few hundred atoms wide, exchanging
1-bit messages billions of times per second.
TERMS
Boolean logic: The algebraic logic devised by George Boole in the 19th
century, used as the basis for the design of fully electronic computers
in Claude Shannons 1937 masters thesis. Shannon showed that any
mathematical calculation could be reduced to a complex network of Boolean
operations and that these could be implemented via electrical circuits.
full adder: A combination of Boolean logic gates that computes the sum of
three 1-bit inputs and produces the answer as a 2-bit binary number. Because
the full adder can include the results of a carry operation, a cascade of
these devices can accomplish the addition of two binary numbers of any
specied length.
XOR: The exclusive OR operation in Boolean logic. When both input bits
agree (00 or 11), then the output is 0; when they disagree (01 or 10), the
output is 1.
READINGS
QUESTIONS
2 Some Boolean logic gates can be created from others. Show how an
OR gate can be built out of a combination of AND and NOT gates.
3 Measuring Information
W
e have seen how any mathematical calculation can be
performed by a complex network of 1-bit messages and
simple logic gates. With this idea, Claude Shannon laid the
conceptual foundations for modern electronic computers.
In this lecture, we will begin to dig into information theory itself, as
Shannon created it a decade later. We begin with the questions: How do
we measure information, and how do we tell whether a message involves
a small amount or a large amount of information?
But our intuition says that two messages should have just two
times as much information as one message, not the square of the
information.
Understanding Logarithms
Understanding Entropy
Why do we call this the entropy? From the outset, Shannon knew
that there was an analogy between his measure of information and
an important concept in thermodynamics. In that eld, entropy helps
determine whether a particular energy transformation is possible or not,
and it can be calculated by taking the logarithm of a certain number.
7KHHQWURS\RIDSDVVZRUGWKHEDVHORJDULWKP
RIWKHQXPEHURISRVVLEOHSDVVZRUGVLVDJRRG
PHDVXUHRIWKHSDVVZRUGVVHFXULW\
Thhe Science
The Sciien
Sc ien
ence
ce off Information:
I form
Inform
Inf o ati
ation:
tion:
o FrFrom
From
o Lan
LLanguage
Langua
angua
g geg to
to Bl
Bla
Black
ckk Hol
H les
Holes 29
What does entropy mean? We can interpret H in two ways: (1) the amount
of information we gain when we receive a message, that is, a measure
of the content of the message, or (2) the amount of information we lack
before we receive a message, that is, a measure of our uncertainty
about the message.
Entropy also tells us something important about the codes we can use
to communicate messages. Again, suppose that our source generates M
different possible messages.
To take a concrete example, let M = 5; the message is one of the
letters A through E. We want to create a code that represents the
output of this source as a codeword made up of binary digits. How
many bits do we need?
Using 2.5 bits per message is still greater than the fundamental entropy
limit, which is 2.32 bits per message, but we have gotten closer to that
limit. How close can we get? Lets try coding triplets of messages.
There are 53 = 125 possible triplets, from AAA up to EEE. The
entropy of a triplet is three times the single-message entropy,
or 6.96 bits. That means that a code of 7 bits could accurately
represent the triplets.
And using 7 bits for three messages means 7/3 = 2.333 bits per
message, which is quite close to the entropy limit of 2.32 bits per
message.
Shannons Conclusions
TERM
READINGS
QUESTIONS
2 In the state of Ohio, new license plates typically have three letters
followed by four numerical digits: XYZ 1234. How many bits of car-
identication information are found on a license plate?
4 You are playing 20 questions, and you know that the unknown target
is a word from the Oxford English Dictionary. Suggest a strategy for
asking questions to nd the word using as few questions as possible.
O
ur everyday intuition says that we measure information by
looking at the length of a message: 5 bits, 1 kilobit, and so
on. But Shannons information theory starts with something
more fundamental: How surprising is the message? How
informative? A long message might not be very informative, while a short
message might be quite informative. The question then becomes: How
do we measure the surprise of a message?
The rst rule of probability is that the probability p(x) is always between
0 and 1: For any x, 0 p( x) 1. If x is an impossible event, then p(x) = 0.
If x is absolutely certain to occur, then p( x) = 1. In other situations, p(x) is
somewhere in betweencloser to 0 if x is less likely and closer to 1 if x
is more likely.
In the long run, our claim is that probability should be the same thing
as statistical frequency. The coin comes up heads half the time.
7KHIDFWWKDWWKHSUREDELOLW\RIQRUHLVFORVH
WRDQGWKHSUREDELOLW\RIUHLVFORVHWR
WHOOVXVWKDWZHQHHGDPHDVXUHRIVXUSULVHWR
PHDVXUHWKHLQIRUPDWLRQFRQWHQWRIDPHVVDJH
But that conclusion feels wrong. To us, it seems as if we learn more when
the alarm goes off than we do when the alarm stays silent.
At any given moment, it is overwhelmingly more likely that there is
no re; p(no re) is close to 1, and p(re) is close to 0. Thus, we more
or less expect the alarm to be silent.
s( x ) = log2
1
p( x )
.
1
If p( x) gets very close to zero, becomes very large, and the
p( x )
surprise s( x) gets large, as well.
(1)
s( x ) = log2 = log2 ( M ) .
1/ M
A model that takes into account the different probabilities for the
different letters is a 1 st-order letter-by-letter model. But the output
of a machine that uses this model is only a little better because the
letters in English text are not independent of each other. Each letter
tells us something about the letters around it.
Zipfs law works fairly well for The Return of Sherlock Holmes. For
example, the word crime is 218 on the list and occurs 60 times.
Twice as far down in the list, at 436, is the word master, which
occurs 28 times, about half as often.
H ( x ) = x p( x )log2
1
p( x )
.
Does that mean that we can use fewer binary digits in our
codewords? Could we represent English with a code that uses only
2 bits per letter? Is there still a connection between entropy and
coding? Well explore these questions in the next lecture.
TERMS
READINGS
QUESTIONS
1 Look at the chart of the letter frequencies of English below. How many
of the least-frequent letters would you have to combine to have about
the same total probability as the most frequent letter, E? What is the
total probability of the 12 most common letters (ETAOIN SHRDLU),
according to the printers rule?
p warm cool
A
fter 50 years of Moores law, with its exponential improvement
in computer and information technology, it has become
absurdly inexpensive to store information. In fact, the cost
of data storage is so cheap that its measured in a new unit
of money, the nanocent. One bit of data storage costs half a nanocent.
Despite this affordability, it is still important to nd codes that pack as
much information into as few bits as possible, in part to enable us to
transfer information from place to place. For this, we turn to the topic of
data compression.
Information Inequality
s( x ) = log2
1
p( x )
.
44 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
Lets return to a simple example to illustrate this idea. Our source has
three possible messages: {a,b,c}. Message a has probability 1/2, while b
and c each have probability 1/4. Thus, the surprise of a is 1 bit, and the
surprises of b and c are each 2 bits. The average surprise, the entropy,
is 1.5 bits.
Suppose that you know the actual probabilities, p(x), for messages
from the information source X. But another message recipient has
some other, possibly mistaken probabilities in mind, designated q(x).
This other recipient believes that all three messages are equally
likely, so that q( x) = 1/3. Thus, each message for this recipient has a
surprise of log2(3) = 1.585 bits.
You and the other recipient start receiving messages from the
source. Sometimes, the other recipient is more surprised than you
are. When a occurs, your surprise is 1 bit, and the other recipients
is 1.585 bits. Sometimes, the other recipient is less surprised. When
b or c occurs, your surprise is 2 bits, but the recipients is still 1.585
bits. On average, the other recipient is more surprised than you are.
x p( x )log2 q(1x ) H ( X ) .
The recipient with the wrong probabilities is always more surprised,
on average. This mathematical fact shows up all through information
theory, so much so that it is called the information inequality.
Samuel F. B. Morse
Morse was not the rst person
to dream of sending messages by
electricity, but he succeeded where
others failed because he kept it simple. A single
pair of wires connected the sender and receiver. The signal was simply
the on/off state of the input key. But how could a meaningful message by
conveyed by on-off-on-off?
6DPXHO0RUVHVEDVLFLGHDVIRUVHQGLQJPHVVDJHVUHPDLQ
HQVKULQHGLQWKH,QWHUQDWLRQDO0RUVH&RGH7KHPRUHFRPPRQWKH
PHVVDJHWKHVKRUWHUZHVKRXOGPDNHWKHFRUUHVSRQGLQJFRGHZRUG
46
46 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
/(&
&785
8 (
( 'D
' WD
'DWWD &RP
WD&
WD&RPSUH
S HVVL
VVLRQ
VVL RQQ DQGG 3U
3UH[
)U
) HH
HHH&RG
& GHV
&RG HV
Morses idea was to represent each letter by a series of electrical pulses.
These pulses came in two varieties, a short one (dot) and a longer one
(dash). In terms of electrical signals, the dot was on-off and the dash
was on-on-on-off. There were longer pauses between letters and
between words.
Morse Code
Efficiency in Messages
Lets consider a code built out of bits in which there are two possible
codewords of length 1 (0 and 1); four possible codewords of length 2
(00, 01, 10, and 11); eight possible codewords of length 3; and so on.
The rst bits are 001, but theres no way to tell whether that means
aab (0-0-1) or cb (00-1). Both aab and cb are represented by the
same binary digits, 001.
If we use this code for a series of messages, it fails the rst principle
of codes: Different messages cannot be represented in exactly
the same way. We never ran into this problem before, because our
previous codewords all had the same length. If that length were 3
bits, then wed take the rst 3 bits for the rst codeword, the next
3 for the next codeword, and so on. We knew how to divide up a
series of bits into individual codewords, but now that the codewords
might have different lengths, we dont know how to divide them.
We can put extra spaces between the codewords, but then, we are
no longer using a binary code. Were using three symbols: 0, 1, and a
space. In an electric signal, how do we distinguish between off0
and a space?
48 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
Heres a new prex-free code: a is 0, b is 10, c is 11. Consider the same
stream of bits as before: 0011100010.
The rst 0 must be an a, as is the second 0. Then comes 1, which
might be the start of either b or c; the next 1 tells us that its a c. The
bit after that is a 1, but this time, its followed by a 0, so thats a b.
As we said, no codeword can be the rst part of any other codeword. That
means that if we have a short codeword, there are many longer codewords
that we simply cannot useall the ones that begin with that short prex.
What does this tell us about the possible lengths of codewords? This
question leads us to what mathematicians call a binary tree.
We start with a point, or node, and from it, we draw two lines off to the
right to two new nodes. Then, we do the same for those nodes. We can
continue this as far as we like, as shown.
Each node at level L gets an amount of water 1/(2L), or 2(L). That makes
sense, because the one unit of water is spread out equally among all the
nodes at that level.
Heres another way to label the nodes: The rst node gets a blank label;
the next two nodes are labeled 0 and 1; the next four nodes (level 2) are
labeled 00, 01, 10, and 11; and so on. Each node at level L has its own
L-bit label.
000
00
001
0
010
01
011
100
10
101
1
110
11
111
50 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
A nodes label gives directions for reaching that node. We dont need
any directions at the starting point. But at each node, we have to
decide whether to continue up or down. Thus, we let 0 stand for up and
1 for down. Node 011 is reached by going up, then down, then down
again. And if we go on from that node, all the subsequent labels will
start with 011.
x 2 L( x ) 1.
This is known as the Kraft inequality. It is a requirement on the lengths
of the various codewords, telling us that we cannot have too many
codewords that are too short.
Lets apply the Kraft inequality to the two codes we devised for the
three-message source, both the bad code and the good one.
In the rst code, a is 0, b is 1, and c is 00. The codeword lengths are
1, 1, and 2, and the Kraft sum is 5/4. But that is not less than or equal
to 1. That sum tells us that the code cannot be prex-free.
If the Kraft sum is less than 1, then we can always shorten some of
the codewords to make it equal to 1.
x 2 L( x ) = 1.
This reminds us of a rule of probability: Any probability distribution
must add up to 1. We can pretend that those numbers 2L(x) are actually
probabilities q( x). They arent the actual message probabilities, p( x), but
some other, ctitious probabilities. Were just making them up out of the
codeword lengths.
52 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
Notice, however: The information inequality tells us that the average
q-surprise is at least as great as the actual average surprise, the entropy.
That means that the average codeword length can never be shorter
than the entropy of the information source.
x p( x) s( x) Codeword L(x)
a 0.5 1.0 0 1
b 0.25 2.0 10 2
c 0.25 2.0 11 2
Notice that for each message, the codeword length equals the
surprise. Thus, the average length equals the entropyin this
example, 1.5 bits.
x p( x) s( x)
a 0.3 1.74
b 0.3 1.74
c 0.15 2.74
d 0.15 2.74
e 0.1 3.32
In the end, our codewords are 00, 01, 100, 101, and 1100. The
average codeword length for this code is 2.5 bits, which is only a
little more than the entropy of 2.20 bits. The best prex-free code
we can nd in any given case is called the Huffman code.
No code can be more efficient than this, but we can nd codes that do
the job with about H( X ) bits on average. Those highly efficient codes
make use of regularities in the messages from the sourceusing shorter
codewords to represent more likely messages, just as in Morse code.
54 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
By considering letter frequencies, we found the entropy to be about
4 bits per letter. By analyzing word frequencies, we estimated the
entropy to be about 10 bits per word, or 2 bits per letter.
The problem of compressing text data is actually not very difficult. Other
types of data, such as images, audio, and video, are much harder to
compress effectively.
The problem is so difficult that Shannons theory by itself cannot
solve it. We need another ingredient.
TERMS
Huffman code: A prex-free binary code that is the most efficient code for a
given set of message probabilities. Devised by MIT graduate student David
Huffman in 1952.
READING
QUESTIONS
1 If you often send text messages or participate in online chats, you will
nd yourself using short abbreviations for common phrases. More likely
messages use shorter codewords! What are some of the most common
abbreviations you use?
2 We can turn Morse Code into a true binary code by writing a dot as 10,
a dash as 110, and the space between letters as 0. (Roughly, 1 means
on and 0 means off for the electrical signal.) Is this code prex-
free? Calculate the Kraft sum. Find the binary codewords for the most
common English letters.
56 /(&785('DWD&RPSUHVVLRQDQG3UH[)UHH&RGHV
Lecture
(QFRGLQJ,PDJHVDQG6RXQGV 6
C
laude Shannons rst fundamental theorem, the principle of
data compression, tells us how far we can squeeze information
in a binary code: The number of bits per message can be
made as low as the entropy of the sourcebut no lower. Yet
sometimes we need to go lower, and for some kinds of data, we can. In
this lecture, well explore the idea of compression applied to image data,
sound data, and video data.
For this picture, we need just 900 numbers to do the job. Each of the
numbers can be written using 8 bits, because 200 is 8 bits long in
binary. That means we can represent the image with 8 900, or 7200
bitsless than one-fth of the number needed if we coded the image
pixel-by-pixel.
Run-length encoding is just one of the techniques that have been used
to represent image data more efficiently. It compresses simple line
drawings and cartoons to a fraction of their original size, but it doesnt
work well with photographs or other images with ne textures and
subtle variations in color. The methods used to compress those images
are quite different, not only in detail but also in their basic philosophy.
Representing Sound
Time
Our hearing can only detect sound waves with frequencies less
than 20,000 Hertz, or 20 kHz. As we will see in Lecture 9, this
means that we only need to record the wave values 40,000 times
per second (40 kHz), just twice the frequency. Most audio for music
is sampled at 44.1 kHz.
Time
= +
Time
This is the idea behind perceptual coding, which was proposed in the
early 1970s by the German acoustical physicist Manfred Schroeder.
The idea is very much in keeping with Shannons original insight into
information: Information is about the distinction between messages.
Perceptual coding tries to preserve the distinction between subjective
perceptions but not necessarily other distinctions.
One common example of the use of perceptual coding is the MP3 sound
format, which is a complex code for sound that is often used in portable
music players. MP3 uses perceptual coding to achieve an amazing
Loud Frequency
Quiet Frequency
y Masking Thre
M Threshold
Amplitude
(dB)
Inaudible Frequency
Inaudibl
Input Spectra
Amplitude
(dB)
Output Spectra
Frequency (Hz)
If we take the Fourier transform of an image, set all the high frequencies
to 0, and rebuild the original, the image becomes blurry. The overall
arrangement is the same, but all the ne detail is lost. The ne detail
affects our perception of the image, but it does not affect our perception
as much as the overall arrangement. Perceptual coding is all about
representing only the differences that we perceive.
This is the essential idea behind the widely used JPEG format for digital
images, and once again, the degree of data compression that can be
achieved is astounding. A JPEG version of a photograph may be 10
times smaller than the uncompressed image1/10 the number of bits
but essentially indistinguishable to the human eye.
At that bit rate, even the most advanced optical discs could store only
a few minutes of video data. To address this challenge, a series of
powerful and exible image-coding formats (MPEG-1, MPEG-2, and so
on) was designed.
Using all these techniques and a few more, we can compress video
data by a factor of 20 or more, with almost no noticeable degradation
or loss of quality. The key word here is noticeable. Perceptual coding
is a signicant advance in data compression, but it works by changing
the problem. Rather than faithfully representing the original messagea
sound, an image, or a videowe identify which parts of the original
information actually affect our subjective experience. We then put our
efforts into conveying those parts and discard the rest.
TERMS
READINGS
The Internet is the place for the best sources about MP3, JPEG, MPEG, and
other media data formats. Start with the online encyclopedia Wikipedia
(which is usually quite good on computer science subjects). Programmers
may wish to follow links to official standards documents that give details of
the various compression algorithms.
QUESTIONS
1RLVHDQG&KDQQHO&DSDFLW\ 7
N
oise is ubiquitous, and it produces ambiguity. With noise,
the message we receive may not tell us everything about the
message that was transmitted. Some information may be lost,
introducing the possibility of error. How do we communicate
in the presence of noise and still avoid errors? This is one of the most
important questions in the science of information, and its the question
that Claude Shannon explored in his path-breaking paper in 1948, in
which he established the foundations of information theory. In this
lecture, well look at Shannons remarkable and surprising answers.
Communication Channels
If the output were always exactly the same as the input, then the idea
of a channel would be trivial, but that is not usually the situation. The
channel establishes the relationship between input and output. For each
possible input, the channel determines which outputs are possible and
with what likelihood they may occur. Thus, the way to analyze a channel
is to use probability.
We can turn that around and say that the joint probability p(x,y ) is
p( x) p(y|x).
The combination of the input and output has a joint entropy, H( X,Y ),
which we can calculate from that joint distribution p(x,y ).
For any two variables with any joint probability distribution, the joint
entropy is no larger than the sum of the individual entropies. This makes
intuitive sense. The information in both variables cannot be any larger
than the information in one added to that in the other. And it might be less.
If the input is a 1-bit message and the channel is perfect, the output
is always exactly the same as the input: H( X ) = 1 bit, and H(Y ) = 1 bit.
Can we ever have H( X,Y ) = H( X ) + H(Y )? Yes, we can, provided the two
variables are totally independent of each other. In that case, p(x,y ) = p(x)
p( y ) for every pair of (x,y ) values.
But that still leaves something left overthe information that the
receiver lacks about X even though he knows all about Y. Lets call
that the conditional entropy, H( X |Y ).
Thats how much information the receiver does not learn about
the transmitted message. Its a measure of failure to communicate.
Information theorists sometimes call the conditional entropy the
equivocation of the message. It tells us how much the receiver isnt
sure about X, even when he or she knows Y.
X Y
1e
0 0
e p(0|0) = 1 e p(0|1) = e
p(1|0) = e p(1|1) = 1 e
e
1 1
1e
This represents what happens with two people, Alice and Bob, talking
at a noisy party. X is what Alice saysyes or no, 1 or 0. Y is what Bob
thinks he hears. That might be the same as what Alice says, but there
is some probability, e, that he hears wrong. The error probability, e, is
determined by the level of noise at the party.
p 0 1
1 1
0 (1 e) e
2 2
X
1 1
1 e (1 e)
2 2
The binary entropy function is easy to explain: The quantity h(p) is just
the entropy of a binary source with message probabilities p and 1 p.
Below is a graph of this function as p varies from 0 to 1.
1 bit
h(p)
0 p 1
Finally, note that the curve of the binary entropy function is very
steep at either end. Even if p = 0.990% of the way over to 1the
value of h(p) is still surprisingly high: 0.469 bitsalmost 1/2.
Now that we know that the joint entropy is 1 + h(p) bits, we can nd out
other parts of our Venn diagram.
For instance, we know that H( X |Y )the amount of information Bob
still lacks about X after he receives Yis just h(e) bits.
Extreme Examples
Now suppose that the party is so noisy that Bob cannot tell at all what
Alice is saying. He is just as likely to be wrong as right about her answer
an error probability of e = 1/2. Then, h(e) has its maximum value, 1 bit.
That makes the mutual information zero. No information is conveyed.
When the output of the channel is completely independent of the
input, the mutual information is zero, and the channel conveys no
information.
Communication Errors
Is it true that a 10% chance of error destroys not 10% of the information
but almost half of it? This seems counterintuitive at rst, but it is actually
quite reasonable, once we consider what an error really entails.
The grading marks Charlie receives (xs for wrong answers and
checkmarks for correct answers) are like the output of a binary
information sourcethe error sourcewith a probability of 1/10
(the value e) for an x and 9/10 (1 e) for a checkmark.
Y
X
1e 0
0
p(0|0) = 1 e p(0|1) = 0
e
B p(B|0) = e p(B|1) = e
e
p(1|0) = 0 p(1|1) = 1 e
1
1e 1
Because the two inputs are equally likely, the table of the joint
distribution is easy to gure out.
p 0 B 1
1 1
0 (1 e) e 0
2 2
X
1 1
1 0 e (1 e)
2 2
The input entropy, H( X ), and the joint entropy, H( X,Y ), are the same
as before, but the entropy for the three-way output, H(Y ), is not the
same. The equivocation, H( X |Y ), turns out to be just e bits, and the
mutual information is 1 e bits per question.
Thus, Diane really is missing just 1/10 of the information on the test.
She knows when she doesnt know the answer to a question, which
means she actually knows a good deal more than Charlie, who thinks
he knows the answer but is sometimes wrong. To correct Diane, the
teacher only needs to give her 10 additional bits of information.
What sort of errors we have matters very much. In the binary symmetric
channel, like Charlies true-false exam, the errors look no different
from correct bits. That means it takes more additional information to
correct the errors. In the binary erasure channel, like Dianes exam, the
errors are self-announcing, and it takes less additional information to
correct them.
Correcting Errors
Here is what Shannon proved: As long as you dont try to exceed the
capacity of your channel, you do not have to communicate more slowly
to get rid of errors.
Suppose you decide on an information rate, Ra number of bits to
send per use of the channelso that R < C, the channel capacity.
This theorem tells us that we can defeat noise without paying too high
a price. There is no inescapable trade-off between information volume
and information reliability. The possibility of error does not force us to
communicate extremely slowly. We can do error correction efficiently,
and the channel capacity tells us just how efficiently.
TERMS
READING
QUESTIONS
2 Suppose our binary symmetric channel has a very low noise rate:
e = 0.001. The expected number of errors in a 1000-bit message is
about 1. How many bits of information are lost in that message?
8 (UURU&RUUHFWLQJ&RGHV
O
ptical disks for music and movies, reliable computer operating
systems, communication with spacecraft billions of miles
awaynone of these things would be possible without the
process of error correction. Its one of the most amazing
and most underappreciatedstories in the science of information. In this
lecture, well examine a number of error-correcting codes.
The fact that most words have only a few neighbors makes English
resistant to errors. Suppose someone sends you a message
consisting of a ve-letter word, but an error occurs in one letter.
82 /(&785((UURU&RUUHFWLQJ&RGHV
Because there are 125 possible changes but only 6 (on average)
are real words, the probability is better than 95% that the error will
be obvious. If the word is part of a larger message, the likelihood is
also very high that the context will enable you to gure out the right
word in spite of the error.
What is true for written text is also true for spoken language. The
potential for error in speaking and hearing is quite high, but because
English is so redundant, a few misheard phonemes dont usually present
a problem. The hearer corrects them unconsciously.
Given this capability for error correction, its wrong to conclude that
human language is an inefficient code for communication.
But Bob will also decode the message correctly if just one of the
bits is wrong. There are three ways that can happen. The rst,
second, or third bits could be the wrong one, and each one has a
probability (1 e) (1 e) etwo rights and one wrongwhich
84 /(&785((UURU&RUUHFWLQJ&RGHV
gives (0.95)2 (0.05) = 0.045. The total probability that Bob recovers
the original 1-bit message is, therefore, 0.993.
The two codewords Alice uses are 000 and 111, which are separated by
a Hamming distance of 3. This number is signicant.
If Alice sends the codeword 000 and an error occurs in 1 bit, Bob
receives something like 001. But with one error, the received 3-bit
word is still closer to the original 000 (Hamming distance 1) than to
the alternative 111 (Hamming distance 2).
Error-Correcting Power
86 /(&785((UURU&RUUHFWLQJ&RGHV
Weve already seen that having d = 3 enables us to correct any single
error, because a single error still leaves us closer to one codeword than
any other.
What about d = 4?
Imagine two codewords separated by a distance of 4.
Codeword Codeword
Error Error
Codeword Codeword
Correction Correction
Codeword Codeword
If d is an even number, we can detect d/2 errors and correct one fewer.
That means a d = 8 code can detect four errors or correct up to three of
them. If d is an odd number, we can correct up to (d 1)/2 errors; thus, if
the code has d = 13, we can correct up to six errors.
In a (7,4) code, n = 7, which means that the codewords are 7 bits long;
k = 4, which means that the codewords represent 4 bits of information.
That means we need 24 = 16 different codewords. As we will see, for this
code, d = 3; thus, the code can correct any 1-bit error.
88 /(&785((UURU&RUUHFWLQJ&RGHV
As a reminder, 0 XOR 0 and 1 XOR 1 are both 0, while 0 XOR 1 and
1 XOR 0 are both 1. The parity bits are calculated in this way: p1 = d1
XOR d2 XOR d4, p2 = d2 XOR d3 XOR d4, p3 = d3 XOR d1 XOR d4.
Suppose one of the data bits is ipped, say d1. That means that two of
the circles in the diagram now contain an odd number of 1s, which tells us
that the problem must be in the overlap of those two circles, namely d1.
If d4 is ipped, all three circles now have an odd number of 1s, and
we know that the three-way overlap is to blame.
If errors are not too frequentfor instance, if double errors and triple
errors are extremely rarethen the Hamming (7,4) code will do a good
job of correcting them.
Suppose the chance of an error in 1 bit is 1 in 1000. In other words,
our channel is a binary symmetric channel with error probability
e = 1/1000.
But if we use Hammings code to send those 4 bits using 7 bits, the
overall chance of making an error is a lot lessabout 2 in 1,000,000.
Reed-Solomon Codes
Higher error rates need more advanced codes. Among the most popular
and powerful types are Reed-Solomon codes, developed more than 50
years ago by Irving Reed and Gustave Solomon.
In a common Reed-Solomon code, the basic symbols are not bits but
bytesgroups of 8 bits considered as a single symbol. The codewords
are 255 bytes long, and they represent 223 bytes of data. This is a
(255,223) code. For Reed-Solomon codes, d, the minimum distance
between codewords, is always n k + 1, or 33 in this case. The (255,223)
Reed-Solomon code can correct up to 16 single-byte errors.
90 /(&785((UURU&RUUHFWLQJ&RGHV
One of the things that makes the Reed-Solomon codes so useful is their
ability to handle bursts of errors. The Reed-Solomon (255,223) code can
correct 16 byte errorsand a byte error could be a mistake in 1 or all of
the 8 bits in that group. That means that the code can correct a burst
of more than 100 consecutive 1-bit errors in the codeword. Through a
process of interleaving codewords so that the bits of each codeword
are spread far apart, the Reed-Solomon (255,223) code can do even
better than that.
However, that triumph of coding and error correction is only part of the
story. The radio transmitters aboard the Voyager spacecraft are very
weakabout 20 watts, yet we are still in communication with them, 12
billion miles from the Sun. How can we get any information at all from
such a weak signal? To answer that, we need to explore more fully how
a continuous signal can be used to transmit discrete bits of information.
TERMS
redundancy: The property of language that uses many more bits (measured
by the number of letters) to represent information than strictly necessary.
This gives natural languages the property of error correction, but it also
provides the basic vulnerability of a cipher system to cryptanalysis.
READING
QUESTIONS
1 Computer scientist Donald Knuth made a study of all the word golf
chains in English. He noted that there were some words with no nearest
neighbor, which he called aloof words. He found hundreds of these.
Can you think of a ve-letter aloof word?
92 /(&785((UURU&RUUHFWLQJ&RGHV
2 An n-bit binary codeword has n nearest neighbors. If the code has
minimum Hamming distance d = 3, the codeword can correct any
single error. That means there are n + 1 codewords that decode to
each original codeword. Because there are no more than 2n binary
codewords in all, that means the number of codewords in a d = 3 code
can be no larger than:
2n
(# codewords) .
n+1
This is called the Hamming bound. How does it work out for the
Hamming (7,4) code?
3 If we are transmitting 4 bits of data, there are only four possible 1-bit
errors. This suggests that we might be able to get by with log2(4) = 2
additional error-correcting bits. But the (7,4) code needs 3 additional
bits. Why?
9 6LJQDOVDQG%DQGZLGWK
D
ecades ago, the Voyager 1 and 2 spacecraft ew past the outer
planets of our Solar System, sending back astonishing images
and a vast trove of scientic data. Today, their instruments
continue to make measurements of magnetic elds and the
space environment. Voyager 1 is now more than 12 billion miles away
from Earth, and Voyager 2 is almost as far. Their radio transmitters are
not immensely powerfulabout as strong as a cell phone tower on
Earthyet we continue to receive data from these spacecraft. How
that is possibleand how the same principles affect communications
here on Earthis one of the most remarkable chapters in the story of
information science.
Carrier Frequency
1000 kHz
Carrier
Time
Ti Frequency
Time Frequency
Time Frequency
We all use the same radio spectrum, the same physical range of
frequencies, and every radio signal has a bandwidthit occupies a
chunk of that real estate. Because we have to share bandwidth, the
government makes regulations assigning different frequency ranges to
different kinds of transmissions or even to particular broadcast stations.
We can also separate signals in space. If two radio stations are far
apart and their power is not unlimited, then their transmissions will not
interfere with each other.
The best example of separating signals in space is electrical signals
in wires. A single wire can carry several voltage signals separated
by frequency, just like radio transmissions. In addition, if the wires
are properly shielded, the signals in one wire do not interfere with
those in other wires. The same is true for light signals in different
ber optic cables.
To understand this, lets think about a telegraph signal. Just two voltages
are used: on or off, which might be 1 volt and 0 volts. If we plot a graph of
the voltage over time, a message looks like this:
Instead of just turning voltage on and off with a telegraph key, lets
imagine sending a smooth sine-wave voltage at some frequency. When
the frequency is low, the output signal is about the same as the input.
However, the more limited the bandwidth, the more the high-
frequency Fourier components are quashed. The voltage changes
are even less sudden.
How fast is too fast? The answer to this question was discovered by an
engineer named Harry Nyquist, along with Claude Shannon.
Nyquist posed the following problem: To send a telegraph signal, we
want to transmit a series of voltage valuesthe on/off values of the
dots and dashes. Suppose the telegraph line has a limited bandwidth,
W. How many different independent voltage values can we send
along it per second? In other words, how many different numbers per
second can be represented by a signal of bandwidth W?
If the dots and dashes of Morse code are sent through the line more
quickly than the Nyquist rate, the transmitted wave wont contain
enough different information. The signal will be undersampled.
Closing the circle in the sampling theorem, which Shannon did in 1949,
is regarded as one of his greatest achievements.
If we measure analog information by counting numbersthe number
of independent values in a signal wavethen we face a paradox.
If we want to prevent errors, we need to use signals that are far enough
apart so that even with noise, their overlap is negligible. But spreading
signals out over a wide range of voltages has a price: High-voltage
signals cost energy. To use greater voltages, we will need more power
for the signals. As a general fact, that power is proportional to the square
of the voltage; a signal with 5 times the amplitude requires 25 times as
much power.
P stands for the average power of a voltage signal. N stands for
the power of the noisea measure of the size of the noise splotch.
Their ratio, P/N, is the well-known signal-to-noise ratio, sometimes
abbreviated SNR.
1 P
C= log 1 + .
2 2 N
Thats the number of bits that the signal can convey.
As long as the noise isnt zero and we dont use an innite amount
of power, the capacity is nite. It depends on the SNR.
Wireless networks are designed to adapt when the SNR ratio is low.
The link still works, but it takes longer to send or receive a given
amount of information.
P
C = W log2 1 + .
N
The Shannon-Hartley formula is the basic law that governs how much
information can be sent over a noisy transmission line or radio link.
As we make our bandwidth wider and wider, we let more noise in,
and that limits the information capacity at a given power. Even a
bandwidth of 10,000 Hz gives us less than 288 bits per second.
Lets go back to where we began, with the radio data link from the
Voyager spacecraft, 20 billion kilometers from Earth. The Voyager radio
transmitter has a power output of only around 20 watts.
The power that matters, of course, is the actual power that reaches Earth.
If Voyager broadcast those 20 watts equally in all directions, the radio
signal on Earth would be incredibly weak. Each square meter would
106
1 06 LLEC
LECTURE
ECTUR
UR
RE 9Si
99Signals
Signa
S
Si gnals
gna lss and
nd Ba
Bandw
Bandwidth
ndwidt
nd
ndw idthh
idt
To have any hope of working, the system must somehow reduce the
noise power (N) to extremely low levels. There are both internal sources
of noise (produced by the radio receiver itself) and external sources
(including terrestrial radio transmissions and noise from deep space).
The antennas of the Deep Space Network are extremely high-gain
antennas; thus, they accept only radio energy from a narrow range of
directions in the sky. That reduces the received noise.
Even so, the Deep Space Network is just barely able to receive the
Voyager data. And as the two spacecraft get farther away, those radio
signals diminish. The SNR decreases, which in turn, reduces the possible
information rate.
When the spacecraft visited Jupiter, less than 1 billion kilometers
from Earth, data was returned at 115,000 bits per second. At Saturn,
the rate was less than half that.
Every spacecraft to the far reaches of the Solar System faces the
same challenge. The New Horizons probe, which ew by Pluto in
2015, transmits its data at only 1200 bits per second.
Now, Voyager 1 is more than ve times farther away than Pluto. Its
data can be transmitted at only 160 bits per second, almost 800
times more slowly.
TERMS
signal-to-noise ratio: Also known as SNR, the ratio of the power of a signal
to the power of the interfering noise. This is often described logarithmically,
using decibels. Along with the signal bandwidth, the signal-to-noise ratio
limits the information rate of an analog signal, according to the Hartley-
Shannon entropy.
READINGS
QUESTIONS
&U\SWRJUDSK\DQG.H\(QWURS\
T
hus far, we have focused on communicationhow we can use
codes to communicate information efficiently and to defend
information from errors in transmission. These are the questions
addressed by Shannons rst and second fundamental theorems.
But codes can have another purpose, as well: to conceal information, to
protect the privacy of messages we transmit. That is the concern of one of
the main branches of the science of information, the eld of cryptography.
Dening Cryptography
7KHSULQFLSOHVRQZKLFKSK\VLFDONH\VDQG
PHFKDQLFDOORFNVZRUNDUHHVVHQWLDOO\LQIRUPDWLRQ
SULQFLSOHVMXVWOLNHFRPSXWHUSDVVZRUGV
110 LEC
11 LECTURE
TURE
TUR
URE 110
0 Cry
0 C
Cryptography
ryypto
pt gra
pt graphy
gr p y an
ph andd K
Key
eyy Entropy
Entrop
Ent
En
n rop
ropy
pyy
With a password chosen from a gigantic set of possibilities, the
entropy is high, and we can be condent that the adversary will not
be able to guess the password.
Cipher Security
One of the earliest ciphers we know about was the Caesar cipher,
so called because it was used by Julius Caesar for his private
correspondence. In the Caesar cipher, the letters of the alphabet are
written in a circle. Then, each letter of the plaintext message is replaced
with the letter that is three places counterclockwise on the circle.
The information that Eve does not havethe choice of code within the
systemis called the key to the code. A cryptographic system is secure
if the entropy of the key is high enough so that Eve would have a hard
time guessing it.
If our cryptographic system is a Caesar shift cipher, the key is the amount
by which we shift each letter. We can simply note how the plaintext letter
A is to be encrypted, and all of the other letters follow from that. For
example, Caesars original cipher has the key X, because A is shifted
to X. In this type of system, there are only 26 possible keys, and the
entropy of the key is low: log2(26) = 4.7 bits.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
M B E A O R L N S I Z C K Q W X U V T J F H G D P Y
The bottom row is a reordering (or permutation) of the top row. How
many permutations of 26 letters are there? As we ll in the bottom
row, we have 26 possible choices for the rst letter, 25 choices for
the second, 24 choices for the third, and so on. The total number
Cryptanalysis
We know that the plaintext is an English phrase or sentence and that the
code used is one of the 400 trillion trillion possible substitution ciphers.
Our rst step to break the code is to examine the frequencies of letters in
the ciphertext. The most common letter is W, which occurs seven times.
This probably means that W is the letter E, which is the most common
letter in English, giving us:
The next two most common ciphertext letters are V and L, which each
occur four times. And the next most common letters in English are T, A, and
O. Looking at the message, we can hypothesize that ciphertext L stands
for T. That conclusion comes from the three-letter group LCW, which could
be T-H-E, the most common English word. Well tentatively add the T:
A Caesar cipher has only a single letter as its key, but the key to a
Vigenre cipher is a whole word or phrasea sequence of letters. To
illustrate, lets suppose the key word is VICTORY. We write the key word
over and over on one line, and underneath it, we write the letters of our
plaintext message:
[VICTORYVICTORYVICTORY...]
[ATTACK AT DAWN]
Now we use the cipher disk to encipher each letter in the plaintext, using
the key phrase to decide how to shift the wheel. The rst letter uses the
V shift, so A is lined up with V, and the plaintext A becomes ciphertext V.
The second letter uses the I shift, so A is lined up with I, and the plaintext
T becomes ciphertext B. Continuing, we get:
[VBVTQBYOLCPB].
Because the Vigenre cipher uses a phrase of many letters for its key,
it has many more possible keys than a Caesar cipher. A higher-key
entropy makes the code more secure. Our example was the seven-letter
word VICTORY, but we could choose a key as long as we like.
TERMS
READING
E
ncryptionthe process of protecting secretsand cryptanalysis
the process of uncovering secretsare two sides of cryptography
that are continually at war. As one side becomes more
sophisticated, the other becomes more difficult, driving innovations
on both sides. In fact, the whole history of cryptography is a kind of arms
race between encryption and cryptanalysis. In the rst half of the 20th
century, some of the greatest triumphs of cryptanalysis were realized.
Code-breaking changed the course of history more than once and opened
new vistas of history to our eyes. This was also the era when Claude
Shannon used information theory to show what secrecy actually meant
and how and when it could be broken.
In the aftermath of World War I, the German military adopted the most
advanced cipher system available at the time: the Enigma system.
Enigma was a cipher machine, originally designed for commercial use.
With a few modications, it became the standard cryptosystem for the
German military. Enigma was the primary German code that the Allied
cryptanalysts faced during World War II.
There were many Enigma variations, but the typical setup was as
follows: The machine had a keyboard of 26 letters. There was also a
light board with 26 bulbs, one for each letter. An electrical signal from
the keyboard to the light board ran through three rotor wheels, each
of which had 26 positions and contained circuitry to scramble the
electrical connections in a different way. The rotors came in sets of ve
for each machine and could be switched out or changed around. At
the front of the case, there was a switchboard-type plug arrangement
called the steckerboard. Using cables, the operator could connect up
to 10 pairs of letters.
To use the Enigma machine, the operator hit a key for the plaintext. The
bulb lit up, giving the ciphertext for that letter. The rotor then advanced,
so that the next letter would be enciphered with a different rotor
conguration.
If the crib was long enough, there would be only a few possible
places where it could occur in the message.
The next step was one of Turings great discoveries about Enigma.
Suppose we have a crib in a possible location within the ciphertext. It
can happen that the plaintext and ciphertext form a loop. The plaintext
letter A is encoded as R. The next place over, plaintext R is encoded as
N. Two places down, plaintext N is encoded as A. A to R to N and back
to A makes a kind of closed loop. Such loops occur surprisingly often.
Turing realized that this loop structure did not depend on the plug
settings of the steckerboard but only on the rotor settings.
During World War II, Shannon was working out his ideas about codes
and communication and realized that information theory could provide a
new approach to cryptography. He tackled the subject by asking some
fundamental questions: What is a cipher? What makes it secure? What
does it mean to break a cipher? When is that possible?
The more we learn about the message, the smaller our missing
information, H. There are fewer possible messages consistent with what
we know. Eventually, we arrive at the point where we know everything,
and there is only one possible message: H = 0 and M = 1. Then, we know
what the message is, or to be more precise, we know everything that is
required for determining what the message is. We might still need to do
a great deal of calculation to turn the message into an understandable
form, but we do not need to receive any more bits.
Suppose Alice is sending a secret message to Bob, while Eve listens in.
Well call the plaintext P and the key K. The ciphertext, which Eve might
intercept, is C.
Now imagine that Alice and Bob use a substitution cipher. Eve
intercepts a short ciphertext message: WBJKRTO.
What information does Eve lack? From the beginning, Eve does not
know the key, K. Thus, she starts out with missing information H(K),
the key entropy. Once the message is sent, she also lacks H(P), the
information content of the plaintext. Each plaintext letter adds hP bits of
entropy; thus, a message of length L has H(P) = L hP. The total amount
of information Eve lacks is H(K) + L hP.
We can graph both the amount of information Eve has and the amount
she lacks, as shown below:
Slope = hC Slope = hP
Bits
Bits
H(K)
L L
Eve knows: LhC Eve lacks: H(K) + LhP
The graph on the left shows the amount of information Eve has.
It starts at zero and gets larger as the message gets longer. The
slope of the line is hC = 4.7 bits per letter, the entropy of one letter
of ciphertext.
The graph on the right shows the amount of information Eve lacks.
It starts at H(K), the key entropy, and gets larger as the message
gets longer. The slope of the line is hP, the entropy of one letter
of plaintext.
The value of hP is not 4.7 bits per letter. As we have seen, English is
redundant. Our experiments led us to estimate that the entropy of English
text is less than 2 bits per letter. Shannons estimate was about 1.5 bits
per letter. Thus, our 26-letter alphabet, with 4.7 bits per letter, introduces
a great deal of redundancy. For each letter, 4.7 bits 1.5 bits = 3.2 bits of
redundant information, on average. The per-letter redundancy of English
is designated D, and D = hC hP, or 3.2 bits per letter.
Bits
The difference is the net
amount of information Eve
ve knows
k
that Eve lacks: H(K) + H(K)
H(P) H(C). That quantity
shrinks to zero eventually. L
When that happens,
there is only one possible
key and one possible plaintext consistent with the ciphertext that
Eve has. That means she can break the code.
When does this happen? It happens near the point of intersection of the
two straight-line graphs. Where does this happen? Lets solve for L.
At the intersection point, L hC = H(K) + L hP, or H(K) = L (hC hP),
which is L D.
If Alice and Bob use a substitution cipher, there are 26! = 400 trillion
trillion possible keys. Thats a key entropy of H(K) = 88 bits. Using
Shannons estimated English redundancy, D = 3.2 bits per letter, the
unicity distance is 88 bits/3.2 bits per letter = 28 letters. A ciphertext
much shorter than this is not enough to break the code; many possible
plaintexts will be consistent with the ciphertext. But if the ciphertext
is much longer than 28 letters, there is probably only one sensible
plaintext consistent with it, and it ought to be possible to break
the code.
First, we must determine the key entropy, H(K), which in this case is 67
bits. Thats actually less than the key entropy of a substitution cipher,
and the unicity distance is correspondingly less: L = (67 bits)/(3.2 bits per
letter) = 21 letters.
Enigmas entropy mountain is not quite so high, but the sides appear
to be sheer cliffs. With his discovery that the rotor and steckerboard
settings of Enigma could be solved separately, Turing managed to
scale a much lower cliff, then follow an easy step-by-step ascent the
rest of the way.
Not all cryptanalysis has to do with espionage and war. In fact, one of
the great achievements of cryptanalysis just after World War II was the
decipherment of Linear B by Michael Ventris and John Chadwick.
Ventris began to search the inscriptions for place names, which are
often very old. The place names that Ventris conjectured gave him a few
syllables to go on, for instance, ko-no-so for Knossos, the place where
many of the tablets were found. These place names gave Ventris a kind
of crib to Linear B.
READINGS
QUESTIONS
1 The Enigma system and ROT 13 are vastly different ciphers, yet they
share one essential property. What is it?
I
n 1943, during World War II, Claude Shannon was called on to evaluate
the security of the encrypted telephone link that allowed Franklin
Roosevelt and Winston Churchill to confer in secret across the Atlantic
Ocean. Codenamed SIGSALY, this system was the rst ever designed
to protect the privacy of that kind of electronic communication. In this
lecture, well look at how this amazing technological achievement
worked. Well then turn to some of the problems addressed in postCold
War cryptography.
SIGSALY
SIGSALY was the rst telephone system to convert analog voice signals
into digital information, using a technique called pulse code modulation.
The digital signal amounted to a highly compressed version of the
original voice data. As a result, the compressed voice data had much less
redundancy than the original. As weve seen, cryptanalysis relies on that
redundancy
redundancy, y, and reducingg it makes the eavesdropp
eavesdroppers
ppers jjob
ob more difficult.
SIGSALY Apparatus
130 LEC
130 LECTURE
LECTUR
TURE
E 112Unbreakable
2U
Unbreak
nbr
b eak
br eakkabl
ablee Cod
Codes
es and
ndd Pu
Publi
Public
blicc Keys
bli Keys
The real secrecy of SIGSALY came from the cryptographic key: a pair of
identical phonograph records, one at the transmitting station and one
at the receiving station. Each record contained the exact same random
electronic noise. Before the SIGSALY voice data was transmitted, it was
added to the random noise from the record. At the other end, the same
noise was subtracted from the received signal, yielding the original.
That, in turn, was decompressed and turned into a recognizable,
understandable telephone voice.
Notice the assumption in this analysis. The secret key has a xed
entropy, H(K). Therefore, the secrecy gap starts out at a xed size
and gets narrower over time. Thats how the enemy catches up.
Plaintext: 01101000011001010110110001110000
Key: 11100011000010001110100011110100
Ciphertext: 10001011011011011000010010000100
To recover the plaintext at his end, Bob simply applies the XOR operation
again, using the key string that he also has. That gives him ciphertext
XOR key, which is plaintext XOR key XOR key. Because anything XOR
itself is zero, this just yields the plaintext again.
If Alice and Bob tried to reuse the key, this binary version of the one-
time pad would no longer be perfectly secure. This fact was the basis for
one of the most remarkable episodes in the history of cryptanalysis: the
Venona project.
During the 1930s and 1940s, the Soviet secret intelligence service,
the NKVD, communicated with its stations in Soviet embassies
around the world using a one-time pad system for ordinary
commercial cable messages. However, for reasons that are not
entirely clear, identical code key sheets were sometimes used in
different pads.
In other words, the problem with the one-time pad is new key
distribution. This function cannot be done using the cryptographic
system itself, because that uses up as much key data as it
would transmit. Some other means, such as trusted couriers,
must be used. But as a practical matter, that approach is a point
of vulnerability.
The one-time pad is reminiscent of the remote keys we now use to open
car doors and garage doors. When you press a button on the remote
unit, it sends a coded radio signal, a string of a few dozen bits, as the
key. The receiver recognizes the key signal and unlocks the door.
The problem with this system is that someone nearby could intercept
the transmission, record the key signal produced by the remote unit,
and re-transmit the same key later to unlock the door. Initially, remote
key units used the same key code repeatedly, but modern systems use
something called a rolling code. Each time you use the unit, it emits a
new key code; no key code can open the door more than once.
The new keys are not really new information, and if someone recorded a
few of the successive key signals, it would be possible, at least in theory,
to deduce the key-generating code and predict the next key. But that
would be a difficult mathematical problem to solve, and the difficulty is
what makes the system secure.
Computational Complexity
The fact is: No one has ever proved that a true trapdoor function
exists. It may be the most famous unsolved problem in mathematics.
Although no one has ever quite proven that NP is not the same as
P, almost everyone is prepared to assume that they are different.
Trapdoor functions do exist, and the factoring problem is one of
them. This mathematical hypothesis leads us to an astounding new
form of cryptography.
Public-Key Cryptography
In principle, anyone who has the encryption key can work out the
decryption keyyou can always factor a number. However, if the
number is very large, it might be impossible in practice to do so.
Eve knows the public key and the ciphertext that Alice sends to Bob.
In fact, she has all the information she needs! All she has to do is
factor the encryption key to get the decryption key, then decode the
message. But if the key is long enough, no computer available to Eve
could do the factoring; for this reason, the message remains secret.
For this system to work, Bob must generate two very large prime
numbersdozens of digits longto create the key. Because proving
that a particular number is prime is the same thing as factoring it,
this step can be tricky. But its possible to identify primes with a high
degree of certainty by applying some mathematical tests.
READING
QUESTIONS
1 How might a cryptanalyst recognize that a one-time pad key has been
used more than once? You may assume that the cryptanalyst is very
patient or has the services of a computer.
S
o far, the science of information has been all about human
communication and human technology. But nature has her own
messages, her own vast networks of information exchange, her
own natural codes for representing information and technologies
for processing it. Thus, the concept of information is indispensable
for understanding the natural world. And nowhere is this truer than in
biology, the science of life.
Mendel published his work in 1865, but it was quickly forgotten. When
it was rediscovered a few decades later, the science of genetics was
born. The location of the genetic information within a cell was soon
identied: the chromosomes, tiny structures within cells, barely visible
in a microscope.
Among those who speculated on the nature of the gene was the
Austrian physicist Erwin Schrodinger. He noted that genetic information
is discrete rather than continuous; that is, it is digital information, not
analog. The rst piece of evidence for this idea was Mendels discrete
binary genes in pea plants. The second was a clever argument about
the stability of genetic information.
Crick and Watson found that two backbones of phosphate and sugar
form the sides of the DNA molecule. In three dimensions, the structure is
twisted in a double-helix form, but we can think of the backbones as the
two sides of a simple ladder.
Crick and Watson were quick to note that the base-pair system
makes copying straightforward. The original DNA molecule rst
splits apart along the join between the bases in each rung. New
units assemble along the split halves, but only certain bases can t
together: A with T or C with G. Thus, the original sequence of base
pairsthe genetic informationis reproduced in the two resulting
DNA molecules.
A living cell contains only a tiny amount of DNA, but that still may
represent billions of bits of information. The DNA information in a cell
is a blueprint of its structure and function. What kind of code is used for
that? One of the rst people to glimpse the answer to this question was
the physicist George Gamow.
Gamow believed that the sequence of base pairs in the DNA molecule
must somehow encode the sequence of amino acids in the protein.
The words of the protein language are the amino acids. Given
that there are 20 possible words, each amino acid represents
log2(20) = 4.3 bits of information. But the DNA language has only
four lettersthe base pairsfor 2 bits per letter.
Thus, Gamow conjectured that in the genetic code, each amino acid
was represented by a codeword, called a codon, consisting of three
DNA base pairs. Later experiments by Francis Crick and others
showed that he was exactly right.
The DNA alphabet has four letters. The codon words are three
letters long, representing the 20 amino acids. These form long
sentencesproteins made up of hundreds of amino acids. One of
those sentences is one of Mendels genes, because the proteins
control the way the organism appears and functions.
Because the two ends of the RNA molecule are slightly different
chemically, the ribosome knows which end to start with. Each group of
three bases is a codon, but the rst codons on the RNA strand might
not represent anything. They are in the untranslated region of the RNA
molecule. Protein building does not start until a special triplet called the
start codon is reached. There is some variation, but usually, this is the
codon AUG, which stands for the amino acid methionine.
The details of the genetic code were worked out in the early 1960s, but
since then, some dramatic technological advances have been made.
We have learned how to read and create very long DNA sequences.
We can record the complete genomethe entire collection of genetic
informationof an organism.
Our own genome contains 6 billion base pairs, or about 12 billion bits.
Only about 2% of that, however, actually describes proteins via the
genetic code. The rest is called noncoding DNA.
For instance, the genes coding proteins are very far apart,
separated by long stretches of noncoding DNA. That means that
they are always translated separately, and that means that a frame
errora base lost or insertedaffects only one of the proteins,
not both.
Chemically, there are many more than 20 possible amino acids. Thus,
there is a universe of conceivable proteins, some of them perhaps
useful, that cannot be made by ordinary living things on Earth. We dont
know how to make them because the genetic code has no room for
them; weve used up all the codewords. Can we change that? Can we
nd a way to alter cells so that they can use unnatural amino acids in
their proteins?
The translation of a codon in mRNA into an amino acid is done by
a short piece of RNA called transfer RNA (tRNA). A tRNA molecule
is bent in a kind of hairpin shape, with side loops. Complementary
bases attach to each other along the parallel sides.
At one end, where the hairpin bends around, three bases stick
out. These three bases can attach themselves to a complementary
set of bases on the mRNA chain: U with A, G with C, and so on.
And each tRNA molecule is bound at the other end to a particular
amino acid.
Finally, and most radically, we can imagine nucleic acids with more
different bases. Our genetic code could then employ a larger alphabet.
In addition to A, U, G, and C, we might have X and Y. The number of
possible codons is now 6 6 6 = 216 codewords, which gives plenty of
room for new amino acids in the new code. This has actually been done
for the DNA in some bacteria.
For example, the codon ACC translates as the amino acid threonine.
But suppose something happens to the RNA and the nal letter is
changed to something else. The codon becomes ACA, ACG, or ACU.
Any of these still translates as threonine. The protein is unaffected.
Weve used computer models to see how our own genetic code stacks
up against other conceivable genetic codes, generated at random. As
it turns out, our code tolerates errors better than the vast majority of
the others. Our basic genetic information system is well-adapted to the
noisy molecular reality.
READINGS
1 In DNA, the distance along the ladder between adjacent base pairs is
0.33 nanometers (1 nm = 10 9 m). What is the length of the entire human
genome? Which is longer, you or your DNA?
I
n the past, people thought that living organisms could emerge from
nonliving matter. This idea was called spontaneous generation or
abiogenesis. Frogs and worms might spring from mud; mice could arise
from discarded food and rags. But the 17th-century Italian naturalist
Francesco Redi argued the opposite. Every living thing, he said, is always
the offspring of another living thing. As he put it, Omne vivum ex vivo
All life comes from life. And by the end of the 19th century, such biologists
as Louis Pasteur had established the truth of Redis principle, even for
the tiniest microorganisms. Yet that leaves us with a question: What is the
origin of life? Where did the rst genetic information come from?
DNA
DN RNA Proteins
The central dogma is true, but its more complicated than it seems.
Consider, for example, viruses.
A virus injects its own genetic information into a cell. There, it is
copied and transcribed, just like the cells own genes, and the cell
ends up making more viruses.
But other viruses, called RNA viruses, dispense with DNA altogether.
Their genetic material is RNA. That RNA is replicated, along with the
viruss proteins, inside the cell. This process changes our diagram:
DNA
DN RNA Proteins
DNA
DN RNA Proteins
But even this picture leaves out something important. Proteins are the
worker bees in the biochemistry of a cell. They form a good deal of the
structure of the cell, and they act as enzymes.
An enzyme is a biological molecule that works as a catalyst,
assisting a chemical reaction; they break molecules apart or join
molecules together. The biochemistry in a cell simply cannot take
place without a number of different enzymes. And almost every
enzyme is a protein.
DNA
DN
NA RNA Proteins
Each arrow in our information diagramcopying, transcription,
translationis a biochemical process that involves taking molecules
apart and putting them together. Each one is attended by a swarm
of specialized enzymesproteinsthat make it all happen. Thus, in
reality, proteins are present everywhere in the diagram.
Here is the larger point: The basic information system of life is complex
and interdependent. Proteins cant be made without nucleic acid
blueprints (DNA and RNA), and DNA and RNA cant be made without
protein enzymes. How does such an interdependent system get started
in the rst place?
Molecular Beginnings
Lets begin with a molecule. Its a long chain of small unitsa kind of
polymer. The units are of two or more different kinds. The particular
sequence of units in the molecule constitutes information. This is
a one-dimensional version of Schrodingers aperiodic crystal, like
the sequence of bases in RNA or the sequence of amino acids in
a protein.
The environment of the molecule contains many of the basic units and
even some shorter segments oating around. Interestingly, chemical
interactions among the fragments tend to cause units to assemble in the
same sequence alongside the original.
What we have described is an information-bearing molecule that, in the
right environment, can reproduce itself. Offspring molecules get their
information from their parents. Molecules with different sequences can
compete for resources. Weve stripped away all the complications found
in present-day organisms, and were left with a naked genome, living
in an environment in which spontaneous copying is possible. Such a
molecule might be the beginning of the whole genetic system of life.
This is just the sort of system that the German biophysicist and chemist
Manfred Eigen was thinking about in the 1970s. Shannon had pointed
out that any process by which information is transmitted may be subject
to noise and result in errors. Eigen realized that errors are important.
A self-reproducing molecule makes a copy of itself. Then, those
two molecules make copies of themselves, and so on. If the
copying process introduces errors, after a while, the vast majority
of molecules in the population will have many errors. There will be
many different versions of the genetic information, and few, if any,
of them will look much like the original.
That should be true not only billions of years ago but also today. We
said that the error rate for the human genome is something like
one-billionth: e = 10 9. The size of the human genome is about 6 billion
The process by which virus DNA or RNA is copied is much more prone
to error. The error rate is about 1 in 10,000. But viruses dont need much
genetic information. The genome of a typical virus is only about 10,000
base pairs long. The product Le is about 1just small enough to pass.
Eigen guessed that in the earliest days of life, the error rate might be
around 1%. That would mean that any successful self-reproducing
molecule could not be much longer than 100 units. Otherwise, it would
be overtaken by the error catastrophe.
Spiegelmans Monster
This number, 100 units, is interesting. Not long before Eigens work,
the American molecular biologist Sol Spiegelman did an interesting
experiment on self-reproducing molecules. He started with a very
simple RNA virus called Q-beta. The RNA genome of this virus is only
4200 bases long, containing descriptions of just four proteins.
Spiegelman extracted the RNA from Q-beta and placed it in a test tube
containing a solution of the raw materials to make RNA, plus a supply
of the RNA replicase enzymethe same enzyme the virus uses to
reproduce itself inside a cell.
The RNA in the test tube began to make copies of itself. Spiegelman
repeatedly transferred tiny samples from one test tube to another.
The time between transfers was only a few minutes. In other words,
Spiegelman was arranging conditions to favor RNA molecules that
reproduced quickly.
This notion was christened the RNA world by another Nobel laureate,
the chemist and physicist Walter Gilbert. In the RNA world, the rst
lifethe carrier of the earliest genetic informationmight have been a
collection of relatively short sequences of RNA.
The RNA world is an attractive hypothesis, and there have been some
recent discoveries that support it.
Many scientists argue that Eigens estimate of the early error
rate1%is probably too large. That would imply that longer RNA
sequences could escape the error catastrophe.
For a long time, the ribozymes Holliger found could only catalyze
the copying of RNA sequences much shorter than themselves.
But very recently, the researchers have discovered a new mutant
molecule, which they call tC9Y. This molecule is 202 bases long, but
it can assist the copying of some RNA strands up to 206 bases long.
The dream of the RNA world, a world of nucleic acids that catalyze
their own reproduction, seems more plausible than ever.
Not everyone, however, thinks that it all started with RNA. Graham
Cairns-Smith, a molecular biologist at the University of Glasgow, looks
at the problem this way: The biochemical system of life is like an arch.
Every segment of an arch rests on the others. If you take any one out,
it all collapses. Given that, how could the arch have been built piece
by piece?
The answer is that there was a scaffold, on which the pieces of the
arch were assembled. Once the arch was complete, the scaffold was
removed, and we no longer see it. We just see the complete, self-
supporting arch.
Cairns-Smith believes that the earliest form of life was very different from
us. It stored information, grew, and reproduced in a completely different
way. But it served as a scaffold, on which the other piecesDNA, RNA,
and proteinswere added. The new biochemical machinery turned out
to be much more efficient. It developed a complex system of its own and
eventually became self-supporting. Thus, it took over, and the old form
of life crumbled away.
Cairns-Smith believes that this lost scaffold for life might have been
crystals of clay. The mineral kaolin, which is a silicate of aluminum, is
extremely common. It forms readily in water solutions. And its crystal
structure can store and copy information.
An ideal crystal is a perfect, repeating arrangement of atoms;
it contains no information. But a real crystal is usually more
complicated. In each small piece, the structure is perfect, but
the atoms are stacked differently in different places. There are
boundaries between these domains.
A few years after Sol Spiegelman created his rogue RNA monster,
Manfred Eigen repeated the experiment. He rst showed that a single
strand of viral RNA was enough to get things going. Then he tried starting
with no RNA at all. He set up his test tubes with RNA components and
replicase enzyme and waited.
When RNA molecules are copied, errors are sometimes made. Genetic
information is lost, and random bits are introduced. Natural selection
can help x the mistakes and restore the lost information. This kind
of error correction process does not happen at the individual level.
We have called the process error correction, because most of the time,
it is conservative, weeding out harmful mistakes in the population. But
occasionally, one of those mistakes by chance produces a genome that
is a better match to the local conditionsa closer representation of
environmental information. Then, the error correction process actually
favors the innovation. Thats how evolution works; it is also the principle
behind one of the strangest computers ever built.
2 4
1 7
5 6
Then, Adleman performed a series of steps that ensured that the only
DNA strands that would easily make copies of themselves were those
with three characteristics: (1) They started at 1 and ended at 7; (2) they
included exactly seven cities; and (3) they included every city at least
once. The resulting sample of DNA consisted of molecules that encoded
the Hamiltonian path. The molecules had found the solution to the
problem: 1 to 4 to 5 to 2 to 3 to 6 to 7.
In the same way, the genetic information in living cells is the result of
eons of natural computation. Their DNA contains a deep record of
the whole history of life, of the chemical facts and environmental forces
that have shaped life at every stage. Some of that record is extremely
ancient, billions of years old. With the help of the science of information,
we are slowly beginning to read it.
TERM
READINGS
2 Why is an RNA world hypothesis for the origin of life more plausible
than a DNA world or a protein world?
1HXUDO&RGHVLQWKH%UDLQ 15
C
omputers and brains are alike in many ways. They both have
inputs and outputs, and they store and process information.
However, the basic components of the brain, the neurons,
are much slower and more imprecise than the electronic
components of a computer. On the other hand, the brain has billions of
neurons, all working at once. Also, the brain consumes much less energy
than a computer. The differences between the two are signicant, yet
thinking of the brain as a computer suggests some interesting questions:
How many bits of information do our senses provide us? What is the
neural code for the electrochemical signals among the neurons? How is
memory stored, and what is the memory capacity of the brain? In this
lecture, well try to map out some connections to learn what the principles
of information can tell us about the workings of the brain.
The bottom-up approach starts with a single neuron and asks: How
does it work? What signals does it transmit? How does it interact
with other neurons? Perhaps we monitor the activity of an individual
neuron inside a live subject, testing to see how changes in the
outside stimulus change the signals.
I(S:R)
The curve bends over and
runs horizontally. It maxes out
at around 2.5 bits. After that,
using more different tones H(S)
just leads to more errors.
Only six tones seems shockingly low. Surely, our ears provide us
with more information than that, but something about the way we
process the information in this simple task limits our capacity.
This experiment can also be done with other types of stimuli. For
example, the tones could differ in loudness rather than pitch. Exactly the
same kind of curve emerges, except that the plateau the capacityis
about 2.3 bits, or around log2(5). We can reliably distinguish about ve
levels of loudness.
Neurons are the basic units for information processing and communication
in the nervous system. Theyre found in almost every type of animal,
except sponges. The general structure of a neuron is as follows: There
is a main body called the soma, which as a group of small branching
appendages called dendrites. There is also a long ber called an axon that
stretches out from the neuron. Down the axon are some more branches
called synaptic terminals, which touch the dendrites of other neurons.
The soma, which puts together all the inputs from the dendrites, does a
kind of information processing as it determines whether or not a pulse
is produced. The axon is a transmission line for the output signal to be
carried to other neurons. The axon signal is not electrical, but it does
involve changes in electric potential; thus, we can detect it electrically.
It is actually a propagating change in the chemical concentrations of
sodium and potassium ions across the outer membrane.
The information speed in the axon signal is not that fast, perhaps tens of
meters per second. This means that neuron signals are a million times
slower than electrical signals in wires. If we were to place ourselves at
one point along the axon and measure the electric potential over time, it
would look like this:
Time
Potential
Each of the spikes represents a pulse going past, and the pulses
last only for a couple of milliseconds.
The interesting thing is that the details of the amplitude and the
shape of the spike do not appear to matter. In fact, those details
might change a bit for a signal, from one end of the axon to the
other, but as far as the neural information is concerned, a spike is
a spike. The information is surprisingly binary: There is a spike, or
there is not a spike.
The only question is when the spikes occur, and even there, it
appears that timing details ner than about 1 millisecond are usually
unimportant.
Now we know what neural information looks like: Its a stream of binary
digits, mostly 0s. They are generated at around 1000 bits per second
and travel down the axon.
Spikes
Time
... 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 ...
Rate Coding
The next question is: What do the bits mean? What is the neural code?
The rst clue here was found in 1926, when Edgar Adrian and Yngve
Zotterman at Cambridge did an experiment on muscle tissue taken
from a frog.
They monitored the impulses from a single sensory nerve ber
in the muscle while they pulled at the tissue with small weights.
They found that the rate of pulses was determined by the amount
of pulling force. A larger force produced more frequent spikes.
For example, a 1/8-gram weight produced about 20 pulses per
second, a 1/2-gram weight produced about 50, and a 3-gram weight
produced more than 100.
But the main point remains: In a sensory nerve, the rate of the
pulses represents the strength of the stimulus. This idea is called
rate coding, and it appears to be the basic neural code used by
many kinds of neurons, including many within the brain.
000010010000101000010001000000100000100001000...
001000000100100000100010000001000100000100100...
However, that number is probably too high. The pulses come along
somewhat randomly, which means that, by chance, a given interval
of time might contain more or fewer pulses. If the expected rate is 9
pulses, the interval might actually contain 6 pulses or 11.
Several factors mitigate this. For one thing, if you have several neurons
carrying the same informationproducing pulses at the same expected
ratethen by combining the signals from all of them, you can make a
faster and more accurate determination of that rate. Also, different
neurons can carry complementary information about the same thing. We
saw this in Lecture 6, when we described hearing.
Sound Frequency
But different neurons will have different tuning curves that peak at
different sound frequencies:
Pulse Rate
Sound Frequency
Orientation
There are millions of such cells in the visual cortex, and they are a
signicant part of how we integrate data from our eyes into visual
perception.
Below are the two binary pulse sequences we saw earlier. Both have
nine pulses. In rate coding, they are equivalent codewords. But because
the pulses occur at different times, they are distinct codewords in
temporal coding. Therefore, the neural information rate would be higher
if neurons use temporal coding.
000010010000101000010001000000100000100001000...
001000000100100000100010000001000100000100100...
Several pieces of data suggest that precise timing matters in the brain.
For instance, suppose you shut your eyes and listen to a sound from
different points. You can tell which direction the sound comes from,
in part because of different degrees of loudness in your two ears. But
experiments prove that most of that sense of sound localization comes
from the difference in arrival time of the sound waves at your ears, and
that difference is exceedingly smallmuch less than 1 ms. Thus, tiny
timing variations must somehow give rise to information in your brain.
Another, more indirect point is this: The nervous system transmits pulse
timing information very exactly. The pulses do not get closer together
or farther apart as they travel. This is not evidence that this timing
information is used in the neural code, but if it isnt being used, why does
the nervous system preserve it so well?
In addition, if the exact timing of the pulses was literally
meaningless, then there would be no reason for different neurons
to re at the same moment; they could produce their pulses
independently. They might occasionally produce simultaneous
pulses by chance, but that would represent nothing special in the
neural information network.
There are many other observations and experiments of the same sort,
all tending to show that precise timing makes a difference. For this
reason, the current consensus is that temporal coding does play a role
in the neural code, at least in certain places. This fact greatly increases
our estimate of how much information is being exchanged by neurons
within the brain.
Memory
But notice, thats how we form memories in the rst place. Thus, in
our memory system, the information is read using the same process
by which it was written. Every time we access a memory, we are
rewriting it, tracing the pen over the letters on the page. In that
process, sometimes the memory information can be altered. This is
why it is possible, even fairly common, to have false memories.
neuron: The basic unit of the brain for information processing and
communication. Each neuron consists of a main body called the soma and
a long ber called an axon. Connections from other neurons induce the
soma to produce nerve impulses, which are transmitted down the axon and
inuence other neurons connected at its end.
READINGS
Microstate Information 16
W
hen the concept of entropy was originally discovered
in the context of thermodynamics, it seemed to have
nothing to do with information. It was simply a way of
determining whether or not an energy transformation was
possible. However, when we look at the story of its discovery through
our information-colored glasses, we can see entropy for what it really
isbecause thermodynamics is not just a practical and profound branch
of physics; it is also a place where the laws of nature and the laws of
information meet.
Systems can exchange energy in two distinct ways. The rst way is
called work, W. This is energy transfer that is associated with a force
acting through a distance, such as the force of gravity acting on a
baseball after you drop it. Less obviously, work is also done when you
charge up an electric battery or when a material becomes magnetized.
The second way that energy is transferred is called heat, Q. If you put
two bodies at different temperatures next to each other, there is a heat
ow from the hotter one to the colder one. That energy transfer isnt
work because the objects are not exerting forces or being moved.
Any sort of energy transformation or exchange aligns with the rst law. It
does not matter, for instance, whether heat ows from hot to cold or cold
to hot; either direction obeys the law. But in the real world, heat only
ows from hot to cold, never the reverse.
In addition, we can always transform work into heat. If you rub your
hands together, they warm up. Friction turns muscular work into
heat. But the reverse is not always possible; we cannot always turn
heat into work.
The second law tells us that this must be positive; total entropy
cannot decrease. Thus, the rst term must be greater than the
second, which means T2 < T1. The heat always ows from the
higher temperature (T1) to the lower one (T2).
Returning to the steam engine, heat ows from the boiler; thus, the
boilers entropy decreases. The engine produces some work, but
work doesnt affect the entropy. To satisfy the second law, therefore,
the engine must expel some waste heat to raise the entropy of the
surroundings. Thats why it cannot be 100% efficient.
Dening Entropy
In fact, for almost any macroscopic system, we know only a few bits of
information, such as the mass, volume, temperature, and so on. That
small amount of large-scale information is called the macrostate of the
system. We do not know the microstatethe exact situation of every
individual atom.
There are many, many, possible microstates that are consistent with the
macrostate. Lets call that number of possible microstates Ma mind-
bogglingly huge number. Because there are M possible microstates,
and we dont know which one we have, we are missing log2(M ) bits of
microstate information. Thats the Shannon entropy, H, of the microstate
message. Thats how many binary digits we would need to record all the
details about all the atoms in the system.
Suppose we take a liter of gas and expand it to twice the volume, 2 liters,
while keeping it at the same temperature. Clausius would calculate its
change in entropy as follows:
We could make the change by slowly expanding the gas from 1 liter
to 2 liters. Ordinarily, a gas would cool as it expands, so we should
add some heat to maintain the same temperature. Adding heat adds
entropy. The change in entropy is just Q/T.
That is exactly the right answer, the same answer that we found by
keeping track of the heat added as the gas expands. But thinking
about information makes the meaning obvious. If we double the
volume, each air molecule has twice as many places it can be. That
means we lack 1 additional bit of microstate information for each
molecule.
You will notice that the Boltzmann formula is related to the Hartley-
Shannon formula for information entropythe logarithm of the number
of possibilities. This assumes, as Boltzmann did, that all the possible
microstates are equally likely, but there are situations in which some
microstates are more probable than others. For instance, because
of gravity, one nitrogen molecule in a room is slightly more likely to
be near the oor than near the ceiling. Thus, not all microstates are
equally likely.
S = k2 p( m)log2
1
.
m p( m)
Over time, the microstate of a system changes. For instance, in a jar of air,
the molecules move and collide with each other. The macrostate might
not change, but the microstate continually changes in complicated ways.
And this is true whether we think of the gas molecules as little balls that
y around and exert forces on each other, acting according to Newtons
laws of motion, or as quantum mechanical particles that propagate and
interact as waves, according to quantum laws. It is also possible that
changes in the microstate lead to changes in the macrostate, but the
laws that govern those changes are still the microstate laws.
However, there is a remarkable and important fact about the laws that
govern how microstates change, known as the principle of microstate
information: As microstates change, microstate information is preserved.
As weve said, information has to do with the distinction between
things. Imagine two microstates that differ, if only in the location of
one particle. They represent two possible ways the system could be
at a given moment.
Imagine now that a system starts out in one macrostate and winds up
in another. For instance, we might start out with separate oxygen and
nitrogen gases, but then we allow them to become mixed. In phase
space, the system starts out in one region, then ends up in another.
Whatever happens does so because of the laws governing the
microstates.
All the different microstates in the initial region end up as different
microstates in the nal region. Therefore, the nal macrostatethe
region at the end of the processmust be at least as large as the
initial macrostate, and it might be larger. The points in Delaware
could all end up in Texas, but Texas could not t into Delaware.
Maxwells Demon
To illustrate this, he
devised a famous thought
experiment. He imagined
a tiny being that came
to be called Maxwells
demon. The demon is so
small that it can perceive and James Clerk Maxwell
act on individual molecules.
After a long while, the demons game will sort the molecules
out: All the oxygen will be on one side, and all the nitrogen on
the other. The gas goes from mixed to separate; the entropy
decreases. In other words, the action of the demon can contradict
the second law.
The demon can violate the second law in other ways by playing the
game differently. If it lets every type of molecule through in one
direction but not the other, it can eventually cause the whole gas to
occupy a smaller volume. If the demon sorts the molecules by their
speed, so that faster molecules end up on one side and slower ones on
the other, it can produce a temperature difference where none existed
before. In any of these ways, the demon can engineer a decrease in the
entropy of the gas.
Was Maxwell right? Is the second law really about us and our limitations,
rather than about nature? For 100 years, one physicist after another
tackled this puzzle. Most of them believed that Maxwell had left
something out of his thought experiment. Perhaps if we analyzed
carefully how the demon might actually work, then we would discover
that it could not reduce entropy, and the second law would be valid after
all. Yet despite repeated analysis, Maxwells demon proved difficult to
exorcise in a denitive way.
TERMS
READING
2 One gram of water contains about 3.3 1022 water molecules. It takes
about 2000 J of energy to boil a gram of water at 373 K (100 C). How
many bits of information are lost per water molecule when it boils?
T
he second law of thermodynamics that Clausius discovered
a century and a half ago turns out to be exactly equivalent to
the following statement: No process can have as its only result
the erasure of information. In this lecture, well explore why the
erasure of information is so important and what it has to do with the
second law.
L L
R R
0 0
We can learn many things from this. In the microscopic world of individual
atoms and molecules, there is no friction. Friction is a big-picture thing,
a phenomenon of systems with trillions of molecules. Friction can cause
the loss of macrostate informationthe distinction between a book
moving to the left or to the rightand it can produce an increase in
entropy. As we will see, those two effects are inextricably linked.
The Hungarian Leo Szilard and, later on, the French Nobel laureate
Leon Brillouin analyzed the demon as a system that acquires and uses
information. Starting from quantum physics, they argued that the demon
must expend energy to acquire its information about the gas molecules.
Perhaps it has to bounce photons off of them, which takes energy. The
energy is dissipated as heat, leading to an increase in the entropy of
the surroundings. This offsets the decrease in entropy achieved by the
demon and, overall, vindicates the second law.
Landauers Principle
One of the people who read Brillouins work was a physicist at IBM
named Rolf Landauer. He wondered: What do the fundamental laws of
physics tell us about systems that process information? What principles
govern the physics of computers?
Landauers own research was about how electrons move inside
materialsthe basic physics, in other words, of modern electronic
computers. But his question was wider.
Landauer found that even if you design a computer perfectly and are
content to run it very slowly, there would still be some waste heat.
Entropy would be increased. The process would be thermodynamically
irreversible, and the reason for that is that computation is logically
irreversible.
From the nal memory state, 7, we cannot deduce the initial memory
state. It might have been 3 and 4, or it might have been 5 and 2. The
computer has forgotten exactly how it started out.
0 0
1 1
0' 0'
Note that the memory bitthe physical thing that stores the 0 or 1
might be very tiny. It might even be a single atom. Weve referred to 0
and 1 as distinct macrostates, but the physical difference between them
might not be very macro. That doesnt matter; the important thing is
that they are distinct.
To capture this idea, well use the word eidostate (Greek:
eidos = to see). The eidostate encompasses all the information
that is available to the computerall of the data that the computer
can currently see and use. That includes macroscopic information,
such as its temperature or battery level, and it includes the 0s and
1s in memory.
It is exactly the same situation as when we slid the book across the
table. Because the system forgets where it started out, the nal
eidostate must be larger. It must have a greater entropy. We can
calculate the change in entropy as follows: S = Snal Sinitial. Snal
is k2log2(M), which is at least as great as k2log2(2M ). Therefore, S
Reversible Computation
Landauer proposed his principle back in 1961. At the time, he did not
fully understand which computer operations had to produce waste heat.
Years later, another IBM scientist, Charles Bennett, took the next step.
He realized that every computer operation could, in principle, be made
logically reversible. A computer could operate, in principle, with no
waste heat at all. Erasure produces entropy, but computation does not
require erasure.
One of the basic logic gates is reversible: the NOT gate. It has one
input and one output. If the input is 0, the output is 1 and vice versa.
Thats a reversible operation; just apply a second
NOT gate, and the original bit value is perfectly
restored.
If we want to do a reversible computation, we need
to use reversible logic gates. These are designed
by the rules of reversible information processing.
The rst of these rules is that the logic gate must have as many bits of
output as there are bits of input. If there is 1 input bit, there is 1 output
bit, like the NOT gate. But if a gate has 2 input bits, it must also have 2
output bits. If it has 3 input bits, it must have 3 output bits.
The second rule is that the logic gate must preserve the input
information, and two distinct inputs must lead to two distinct outputs.
Lets illustrate this rule by describing a useful 2-bit reversible logic gate:
the controlled-NOT gate, or CNOT.
The 2 bits play different roles. One is the control bit, c. Nothing
happens to the control bit; it stays unchanged in the output. The
second bit is called the target bit, t. If the control bit is 0, then the
c c
CNOT
t c+t
We can see that the gate has the right number of output bits; thus,
the rst rule is fullled. To verify the second rule, we can make a
table of input and output values.
c t c = c t = c t
0 0 0 0
0 1 0 1
1 0 1 1
1 1 1 0
If the target bit starts out as 0a xed constantand the control bit is
0, both bits end up 0. If the control bit is 1, both bits end up 1. Either way,
we end up with two copies of the control bit. This is how we can copy
information in a reversible way. We need to start out with a target bit of
0some blank memory in which to create the copy. Then, we use the
CNOT operation.
Can such reversible gates actually be built? Toffoli, along with Edward
Fredkin of MIT, proposed an idealized model. It is less of a practical
blueprint than a proof-of-concept. The idea is to do our computing with
billiard balls.
The balls move along various possible paths, corresponding to the bits
in our diagram. If there is no ball on one path, thats a 0. If there is a ball,
thats a 1. The balls bounce off of xed obstacles to change direction, and
logic operations are performed by precisely timed collisions between
the balls. Because there is no friction and the collisions are perfectly
elastic, everything is reversible. Send the balls backward on their paths,
and the computation would run backward. There is no waste heat.
The billiard ball idea sounds more like a vast pinball machine than a
practical computer, but it illustrates a point. A real physical computation
is actually performed by the laws of physicsin the billiard-ball computer,
according to Newtons laws of motion. And that, in the deepest sense, is
what a computer is: a piece of nature that we have arranged so that its
natural evolution encodes some mathematical calculation.
The Toffoli gate achieves its reversibility by keeping around some extra
bits at the output end; this is no accident. Bennetts reversible computer
does not have to generate waste heat, but it does produce waste
informationextra bits stored in memory that are not part of the nal
answer and about which we may not care. If we want to eliminate them,
we will have erase them, and that means we will have to pay Landauers
This all started from thinking about Maxwells demon. Does the demon
provide a loophole in the second law of thermodynamics? Bennett
thought about Maxwells demon as a reversible computer, making it
reversible to avoid unnecessary waste heat. As it operates, it acquires
information about the molecules in a gas and uses that information to
sort the molecules out. The demon does not produce any waste heat,
and the entropy of the gas appears to decrease. The second law seems
to be contradicted.
Szilard and Brillouin were wrong. The problem with the demon is not
the cost of acquiring information. A cleverly designed demon can do
that without generating any entropy. The problem is when the demon
eliminates information. Because of Landauers principle, that cannot be
done for free.
The Landauer form is truly equivalent to the Clausius form of the second
law. Each one implies the other. Put another way, any machine that can
violate one form of the law could be used to violate the other form.
Imagine something that violates the Landauer form of the law. There
is some way to erase information without cost. We turn Bennetts
reversible Maxwells demon loose on a sample of gas. It drives the
gas entropy down while lling up its memory with useless data. Now
the demon exploits our hypothetical free-erasing process to empty
its memory; it returns to its initial state. The only net result is that the
entropy of the gas has decreaseda clear violation of the Clausius
form of the second law.
The waste heat from that process would be returned to the body
from which the heat was originally extracted. The net result would
be that the body had returned to its original state, and nothing else
had changed except that we erased some bits.
Just to the right of the barrier is another laser; lets call it yellow. This
is the demon laser. The laser, in a sense, nds out where the atoms
are and manipulates their states accordingly. When a green atom passes
through this beam, it absorbs and emits photons, then turns into a
red atom. Therefore, any atom that crosses the middle barrier is soon
switched to red and stays on the right side afterward. Eventually, the
whole cloud is on the right.
Raizens experiment does not violate the second law; the photons carry
away entropy from the color-switching process. But it does provide a
way of compressing the cloud of atoms without heating it up, and thats
new. It may have applications in everything from medical imaging to
isotope separation.
eidostate: A neologism (based on the Greek word meaning to see) for the
state of a thermodynamic system, including all of the information available to
Maxwells demon, whether it is macroscopic or microscopic.
READINGS
QUESTIONS
DQG6WRFN0DUNHWV 18
J
ohn Kelly, like so many pioneers of the science of information,
was an engineer at Bell Labs in the 1950s, around the same time
that Claude Shannon worked there. Kelly, however, taught us to
look at information theory in a new and entirely different way. For
Shannon, information theory was about codes and messages. For Kelly, it
was all about gambling. The relation he found between Shannons theory
and making bets opens one of the most controversial chapters in the
science of information. As we will see, the same concepts developed to
describe communication also have a nancial side, with applications from
Las Vegas to Wall Street.
Understanding Odds
Lets start with a binary horse race. Two horses, X and Y, are at the
starting gate. We want to bet on the outcome of the race, but rst, we
need to know the odds.
The idea of odds can be confusing. Odds are used for different things
and expressed in different ways. What are the odds for horse X to win?
That question might mean two different things.
It might be asking the likelihood that horse X will win, in which case,
the odds are statistical or probability odds.
Using the British system, we say that horse X has 4 to 1 odds. This
means that on a $1 bet, if X wins, we are paid $4, but if X loses we
must pay $1.
For our purposes, we will assume ideal odds, and to keep things simple,
we will suppose that the odds are even in our binary horse race. The
bookmaker thinks that each horse is equally likely to win; q( X ) and q(Y )
are both 1/2, and each has decimal odds of 2.0. A $1 bet on either horse,
if successful, yields $2.
H = p( X )log2
1
+ p(Y )log2
1
p( X ) p(Y )
.
Before we move on, note this fact: From the bookmakers probabilities,
q( X ) and q(Y ), we can calculate the bookmakers surprise for each of
the horses.
Surprise is log2 of the reciprocal of the probability, or log2(1/q( X ))
and log2(1/q(Y )).
H p( X )log2
1
+ p(Y )log2
1
q (Y ) q(Y )
.
John Kelly imagined a series of identical races, one after the other. In
each, the bookmakers decimal odds are 2.0 on each horse. For each
race, we get a tip that leads us to adjust our probabilities. Without the
tips, we would not expect to win any money in the long run, but with the
tips, we can hope to win more often than we lose and come out ahead.
Kelly asked two fundamental questions: (1) What is the best betting
strategy in the long run? (2) What payoff can we expect, in the long run,
from that strategy?
A GA G2 A G3 A G4A
$100 $125 $156 $195 $244
Maximizing Investment
With this understanding, we can answer such questions as: Which would
be better in the long run, an investment of $10 that grows at 10% per year
or an investment of $1 that grows at 20% per year? The growth factors in
the two cases are 1.10 and 1.20. We can use those to calculate the overall
growth factors for years into the future:
After 30 years, the 20% growth factor is more than 10 times the
other factor; thus, that investment is now worth more in absolute
terms, even though it started out much smaller.
In the long run, the larger growth rate always wins, no matter
how things start out. But that is a statement about the long run.
Practically speaking, 30 years might be too long to wait!
A G 1 A G 2G 1 A G 3 G 2G 1 A
Lets consider an easy example. In year 1, our growth rate is 50%. The
growth factor G1 is 0.5. In year 2, our growth rate is 100%. The growth
factor G2 is2.00. Overall, then, we broke even: G2 G1 = 1.0. What is the
equivalent annual growth rate? What steady growth factor G for two
years would yield the same net return? The answer is obvious, but its
not the average of the Gs.
If we average 0.5 and 2.0, we get 1.25, which is too large. What we
really need to calculate is the geometric mean. For two numbers,
thats the square root of their product: G = (G2G1)1/2. In our example,
that formula gives 1.0, which is the correct annual growth factor.
The same equation looks quite different if we take its logarithm. But rst,
remember two logarithm facts:
If we multiply numbers together, their logarithms add up: log2(xy) =
log2( x) + log2( y ). The product of growth factors turns into the sum of
their logarithms.
Even if we think that horse X is more likely to win, we must not bet all
our money on X because X might lose. In the long run, X is sure to lose
eventually. When that happens, the growth factor for that race is 0. Thus,
the overall growth factor, the product of the factors for all the races, is
also 0. Thats called the gamblers ruin, and we need to avoid it.
One way to avoid the gamblers ruin is not to bet some money; another
way is to hedge our betsbetting some money on each horse. Because
one of them is sure to win, we can never get completely wiped out. It
turns out that since our bookmaker is offering ideal odds, the two ways
are exactly equivalent. There is no real difference. We will assume that
we will always bet all of our money, just dividing it somehow between
the horses.
Suppose horse X wins. We bet b( X ) on that horse and won back $2 for
every $1 we bet. The money we bet on horse Y is lost. Therefore, our
growth factor when X wins is 2b( X ). When Y wins, the growth factor
is 2b(Y ).
Also, we can separate each logarithm into the sum of two logarithms:
log2(2) = 1, and p( X ) + p(Y ) = 1. Therefore, log2(G) is just 1 plus the
sum of the ps multiplied by the logs of the bs. We can rewrite this in
a clever way, remembering that log2(b) is just log2(1/b):
1
log2 (G ) = 1 p( X )log2
1
+ p(Y )log2
b( X ) b(Y )
.
Suppose the tip gives us no information. In that case, I = 0. After the tip,
we still have an entropy H = 1 bit. We are as uncertain about the winner
as we were before. The best we can do is to divide our bets equally
between the horses. Because we always win one of the bets, we get
our money back, but no more than that. The growth factor G = 1, so
log2(G) = 0, just as Kellys formula tells us.
If the tip tells us everything about which horse will win, I = 1 bit, and
H = 0. We should, therefore, bet everything on the sure winner. We will
double our money each time: G = 2, which means log2(G) = 1. That also
agrees with Kellys formula.
Now suppose that our tip tells us only a little. After we get it, we estimate
that one horse has a 75% chance of winning and the other, 25%. Thats
how well allocate our bets. The new probabilities yield an entropy
H = 0.81 bits, which means that the tip gave us only 0.19 bits of side
informationthat doesnt sound like much. According to Kelly, we can
make this equal to log2(G), so the gain factor G = 20.19, or 1.14. In the long
run, we expect to increase our wealth about 14% per horse raceeven a
pretty poor tip can help!
Kelly also considered the situation where the bookmaker does not offer
ideal odds. He reduces them a little to give himself an edge. Our best
strategy is now more complicated. It might be that some part of our
money should be kept in cash, rather than wagered. Also, we may not
wish to bet on every horse. If the odds are bad enough, it may even be
better not to bet at all. In any case, Kellys information-theoretic analysis
tells us how we should bet to maximize log2(G), and it tells us how much
each bit of side information is worth.
One critic of the log-optimal Kelly strategy was Paul Samuelson of MIT,
probably the most accomplished and inuential mathematical economist
of his day. Samuelson did not disagree with Kellys mathematics; he just
disagreed that the math expressed a realistic view of the relationship
between people and money.
First, Kellys strategy is only optimal in the long run, but that might be
very long indeed. For any given nite amount of time, the log-optimal
strategy exposes the investor to the possibility of very large losses. To
maximize the gain factor, the strategy accepts a great deal of risk.
The point about risk brings up an even more basic point. According
to Samuelson, the log-optimal strategy ignored the way that people
actually value money. If you have $100, then $10 extra seems like a lot.
But if you have $100,000, that same $10 is not such a big deal. Thus, the
relation between money and value can be complicated. What we really
TERMS
geometric mean: The square root of the product of two numbers or, for
N numbers, the Nth root of their product. The average growth rate of an
investment or debt is just the geometric mean of the annual growth rates
over its lifetime.
QUESTIONS
I
n the 1960s, an entirely new theory of information was discovered
independently by the most famous mathematician in the world and by a
teenaged undergraduate at the City College of New York. The famous
mathematician was the Russian Andrey Nikolaevich Kolmogorov, and
the undergraduate was Gregory Chaitin, who would later become an
eminent computer scientist and researcher at IBM. While Shannon had
based his theory on communicationin which a message is encoded
and sent to a receiverKolmogorov and Chaitin based their new theory
on computationin which a program is executed and produces some
output. The theory came to be called algorithmic information theory.
From this point on, we will assume that weve agreed on a particular
universal computing machine. It might be a Turing machine, or it might
be some other kind of computer. We will designate it UM for universal
machine.
For convenience, we will allow our UM a few extra features. These do not
change anything that the UM can do, but they make it easier to discuss
how it operates. We will assume that the input to the UM, the program
and any initial data, is provided on a separate tape that the machine reads
from left to right. The output of the UM will be printed on its own tape.
Everything else occurs inside the machine, on an internal work tape.
Finally, when we discuss a program for the UM, we wont try to describe
it by 0s and 1s. Instead, well give a heuristic, high-level description that
we can understand. Computer programmers call this pseudocode
not an actual program but a careful outline from which a program could
be written.
Decoding as Computation
Suppose that our message is itself a binary string, like the one shown
below. Actually, s1 is 1 million bits long, but the rest are just like this
string. Our job is to describe the string as briey as possible.
s1 = 01010101010101010101010101...
If the receiver is a very simple machine, then well have to send the bits
one by one, but that seems very inefficient. Intuitively, there is just not
that much information in the string. If the receiver is a computer like the
Note that this takes advantage of the pattern in the string. The result is a
program that is much less than 1 million bits long.
s2 = 100011100001010011111001...
s3 = 110010010001111110110101...
= 4 1 + + ... .
1 1 1
3 5 7
Algorithmic Entropy
We have our UM, whose programs and output are binary strings. If the
program p produces the output s, then halts, we can symbolize that with
an arrow: p s. Not every program produces an output because not
every program actually halts; some get stuck in an endless loop. But if p
s, then p is a description for s as far as our UM is concerned.
Any binary string s (program or output) has a length L(s), the number
of bits in s. For example, for our binary string s1, L(s1) was 1 million. In
contrast, the simple program that generated itp1 was shorter; L(p1)
was much less than 1 million.
For any string, there are many different programs that produce it as
output. Thus, for string s, well let P(s) stand for the set of all programs p
for which p s. Some programs in P(s) may be quite long and inefficient;
however, there must be a program in P(s) with the fewest bits. That
program is the shortest description of s. Well call it s*.
The two entropies are related to each other. For a source that produces
binary sequences, the Shannon entropy is approximately the average
of the algorithmic entropy, taking an average over all the possible
sequences that the source might produce: H ave(K). Shannon entropy
is a way to estimate the algorithmic entropy, on average.
Dening Randomness
Lets play this game a little differently. The monkey will be a programmer
for our universal computer. Its keyboard will just have two keys, 0 and 1. It
hits them with equal probability, 1/2. Whatever the monkey types is given
to the UM as a program. We run it and see what the computer does.
Note that we have assumed the UM reads its input tapewhere the
program isfrom left to right. Thus, if the rst 100 bits of the tape form
a valid program, none of the rest of the bits will even be read by the
computer. If p is a valid program for the UM, then p followed by some
additional bitssuch as p + 0110100is really the same program. Only
the rst L(p) bits of the input tape matter.
What is the probability that the monkey programmer will write program
p? Each bit in p has probability 1/2; thus, the overall probability is (1/2) L(p),
or 2L(p). Thats the monkey probability of program p.
The monkey is sure to type some kind of program, and the computer will
do something. Thus, the sum of all the monkey probabilities is 1. We can
write that fact using the sigma notation for sums:
2 L( p) = 1 .
p
p1
p2
p3
p4
...
p5
...
...
By far the most probable way for the monkey to produce s is for it to
type program s*. This means that it is not all that unlikely for the monkey
to produce a string of a million 1s. The shortest program that produces
that string is very short. We can just barely imagine a monkey typing that
program by accident. The total monkey probability M(s) for string s is,
therefore, about 2L(s*), which is 2K.
We take the logarithm of this equation and solve for the algorithmic
entropy K(s):
K ( s ) log2
1
M ( s )
.
Articial Intelligence
This is much more serious than it sounds. In fact, it can shed new light
on a centuries-old philosophical principle, Ockhams razor, which states
that when we are faced with a choice of hypotheses, we should choose
the simplest one. The simpler theory is not necessarily the true one, but
it is more likely to be true.
Any theory p is a program within P(s). Ockhams razor advises that the
computer should choose the shortest one, which is s*. As we saw, that
one is more likely in monkey probability! If a monkey were to generate
theories at random, and it happened to produce a theory that matched
the data s, then it would be most likely to get there by typing s*. Thats
the sense in which s* is the most likely theory.
Of course, if there are two or more theories that have the exactly same
number of bits, they will be equally likely in Solomonoffs monkey
version of Ockhams razor. The computer will have no basis to choose
between them. But as we saw, even a few bits one way or the other
can make a signicant difference in probability. The monkey odds are
strongly in favor of the shortest description, s*.
Thats the question that caught the interest of Wojciech Zurek of Los
Alamos National Laboratory. He reasoned as follows: When the demon
acquires its information, it reduces entropy elsewhere. When the demon
erases its information, it increases entropy elsewhere. The demons
information must be a form of entropy! Yet the memory record is some
particular string of bits. It is not missing microstate information, like the
ordinary entropy. Its not missing at all. The demon knows it. Its part of
what the demon sees, part of its eidostate. How can that have entropy?
Zurek showed that the total entropy, including the algorithmic part,
never goes down on average at any stage. As the demon acquires and
uses information, ordinary entropy decreases, but K increases. As the
demon erases its data, K decreases, but ordinary entropy increases. The
second law holds true throughout.
TERMS
READINGS
QUESTIONS
3 What does the word random mean? What does it refer to? How is this
related to data compression?
DQG,QFRPSOHWHQHVV
S
hannons information is based on codes and communication.
Algorithmic information is based on descriptions and
computation. This new way of thinking about information brings
with it new insights on many things: the meaning of randomness,
Ockhams razor for choosing between hypotheses, and even Maxwells
demon and the second law of thermodynamics. Yet algorithmic
information is beset by a strange impossibility, one that connects this new
concept of information to the very foundations of logic and mathematics.
Well examine that impossibility in this lecture.
It turns out that most strings of a particular length are random. Consider
strings that are 100 bits long. There are 2100 of them, which is a fairly
large number. Well say that s is random if K(s) is at least 90, that is, s* is
at least 90 bits long.
How many non-random 100-bit strings can there be? Each of these is
describable by a program less than 90 bits long, which means that the
number of such strings is limited by the number of such programs. That
number of programs is less than 290, which is also a large number, but
its much smaller than 2100. In fact, their ratio is about 0.001:
290 1
= 210 = 0.001.
2100 1024
At least 99.9% of strings with 100 bits are random by this denition. In
other words, almost all strings are algorithmically random. The ones with
patterns are the rare exceptions.
1. Run s**.
2. Run the output of step 1 as a program and
print its output.
3. Halt.
Program r is really just s** plus a few extra instructions; thus, the
length of program r is about equal to the length of s**, which is, by
our assumption, much shorter than s*.
Program f: (input x)
1. Calculate x2 and print it.
2. Halt.
Not all computer programs ever come to a stopping point; some get
stuck in an innite loop. And some functions can be like that, too. A
function f might calculate output and halt for some inputs but get stuck
for others. For computer programmers, it would be handy to know in
advance whether or not a given program is one of those that will run
forever or will eventually nish. For the UM, we would like to know
whether program (f,s)function f acting on input sever halts.
As a logical fact, every program (f,s) either will or will not halt. The answer
is 1 bit of information: yes or no. Thus, for every (f,s), there is either a 1 or
a 0, depending on whether it halts or not. Thats the halting function, and
it certainly does exist mathematically. Its a function we would like to be
able to compute.
Program t: (input x)
1. Compute h(x,x).
2. If h(x,x) = 0, halt. If h(x,x) = 1, loop
forever.
In short, the program (t,f ) always does the opposite of what (f,f )
would do. If one halts, the other does not and vice versa.
According to Turing, the only place we could have gone wrong is when
we assumed that we had a program h to compute the halting function;
therefore, no such program can exist. The halting functionwhich is
well-dened mathematicallyis not computable by any program.
If we tried to write a program to tell whether another program
halts or not, no matter how our program worked, it would have
to fail sometimes. Either it would make mistakes by giving the
wrong answer, or it would sometimes get stuck and be unable
to decide.
As Turing did for the halting function, we will assume for the sake of
argument that there is a program k that calculates K(s). If we provide our
UM with program k followed by a string sthat is, the complete program
(k,s)then it will compute K(s), print that out, and halt.
Program k itself has some length, L(k). It might be quite a long program
perhaps 100 million bits long. But however long it is, we could come
up with a number (N) that is more than 1 billion times larger than L(k):
N > 109 L(k).
Next, we must note that it is relatively easy to generate all binary strings,
one by one, starting from the shortest. That is, we start with 0, then 1;
then 00, 01, 10, 11; then 000, 001, and so on. First we get all the 1-bit
strings, then all the 2-bit strings, then all the 3-bit strings. This procedure
is not very complicated, and if we keep it up, eventually, we hit every
binary string of every length.
Program b:
1. Generate the next string s.
2. Compute K(s) using k.
3. If K(s) < N, go back to step 1. Otherwise, print
s and halt.
Implications of Uncomputability
Why does this matter? Suppose Maxwells demon is at the stage where
it wants to erase its memory record, m. Every bit that it erases will
cost energy and produce an entropy of k2 in the environment. Thats
Landauers principle.
Naturally, the demon does not wish to use any more energy than
necessary. Thus, before it erases its memory, the demon will do
data compression on it, so that it erases a smaller number of bits.
This must be a reversible computation, of course. The demon
wants to nd the smallest bit-string from which the original memory
m can be reconstructed. And, of course, thats just m*, the shortest
description of m on a universal computer. That m* is K(m) bits long.
Some programs that the monkey types will eventually halt. Others
will run forever. The number is the total monkey probability that
the program halts. In symbols, we just add up all the probabilities
for the halting programs:
= 2 L( p ) .
p halts
= 2 L( p) < .
p known
to halt
Given that there are some other halting programs that we havent
identied yet, we know that is a little less than . The problem is
that we never can know exactly how close is to .
TERM
QUESTIONS
Quantum Information 21
I
n the last years of the 19th century, the German physicist Max Planck
was concerned with entropy. The entropy formulas of Boltzmann
and Gibbs worked beautifully for the thermodynamics of gases,
solids, and liquids. But when Planck tried to use the same ideas
for radiationlighthis calculations went badly awry. What Planck
eventually discovered in 1900 was something quite unexpected. The
entropy formulas were correct; what was wrong was our understanding
of the microscopic world. Plancks discovery was the beginning of the
greatest revolution in the history of physics: the quantum revolution.
This revolution is still in process, and right now, some of its most exciting
events are taking place in the science of information.
What Planck found is that the microscopic world is far more digital than
we imagined. A light wave does not have a continuous range of energies
but only discrete values; it is quantized. The indivisible packets of energy
are called photons.
At the same time, many other ideas that researchers were working
on made sense as quantum information, including quantum
entanglement, quantum cryptography, and quantum computing. If we
want to build a new quantum information theory, we should start by
trying to nd an equivalent to Shannons rst fundamental theorem (the
data compression theorem): The Shannon entropy equals the number
of binary digits needed to carry a message. What was needed was
quantum versions of all the pieces of Shannons theorem.
A bit is any classical system that has two distinct states, which we can
label 0 and 1. A qubit is a quantum system with two distinct states. For
example, a photon can be horizontally polarized or vertically polarized;
thus, we can label the horizontal state 0 and the vertical state 1.
0
Classical Bit Qubit
|0> |1>
H V
H H
H iV L
H+V HV D A
H + iV R
V V
But those two states are so close to each other that they cannot
be reliably distinguished by any measurement. They are not as
physically distinct as 0 and 1. This means that the Shannon entropy
is far too high.
Luckily, John von Neumann had already gured out the quantum
version of the Gibbs formula for thermodynamic entropy, and we
can adapt it as an information entropy. Von Neumanns formula
gives just 0.006 bits for our photon polarization example.
The setup for this experiment is as follows: We have a source that sends
out photons one by one. Each passes through an opaque barrier with
two small openings. A bit further on, the photon strikes a screen that is
covered with a dense array of photon detectors, rather like the array of
detectors in the CCD chip inside a camera. The photon is recorded at
just one of those detectors.
If we open up just slit 1, then the photon might land anywhere in a
wide range of places on the screen. The same is true if we open up
only slit 2.
Slit 1
Slit 2
It is easy to think about this in the wrong way. We naturally assume that
the photon goes through one slit or the other. We may not know which
slit that is, but nature knows; the uncertainty is all in our minds. But thats
not what is happening here.
If we were to change the experiment and nd out which slit the
photon passed throughperhaps by placing super-sensitive non-
absorbing photon detection devices beside the slitthen the
interference fringes disappear. The superposition is undone. The
quantum information is destroyed.
But in the mid-1990s, Peter Shor of Bell Labs and Andrew Steane of
Oxford University discovered the rst quantum error-correcting codes.
By using many qubits to encode the quantum information of 1 qubit, their
codes made that qubit of information more resistant to errors.
The qubit state 0 might be represented by the codeword state 000
of 3 qubits. State 1 becomes the codeword state 111. A superposition
of 0 and 1 becomes a superposition of 000 and 111. Thats not three
copies of the qubit state; its an entangled state of 3 qubits. The
quantum information is not duplicated; it is shared by all three
codeword qubits. And that makes it resistant to error processes that
affect only 1 qubit.
This quantum code is a little too simple, but there are longer codes
involving 5 or more qubits that are entirely immune to single-qubit
errors. Any 1 qubit of the codeword could suffer decoherence or be
damaged or destroyed, and still, the original quantum information
can be reconstructed by the receiver.
Consider two noisy quantum channels. The expression for the maximum
quantum information through one channel is Q. When we use the two
channels together, we actually can get more than 2Q. The reason for this
is that the channels can be used in an entangled way. The extra capacity
in the Hastings example does seem to be very small, but no one has
proven that it always must be small.
Unfortunately, we cannot read out all the answers to all those different
computations. But in some cases, we can combine the answers using
No one has yet built a quantum computer having more than a handful
of qubits because all the computers qubits must be informationally
isolated from the surroundings. At the same time, they cannot be isolated
from each other, because they must exchange quantum information to
carry on their computation. That combination is difficult to achieve. At
the moment, all the proposals for large quantum computers look like
Babbages analytical engine: impressive in conception but perhaps
beyond the abilities of their inventors to build.
READINGS
2 Why is decoherence much more rapid for large objects than for small
ones?
Q
uantum entanglement between particles has been
recognized and studied for a long time. The term goes
back to Schrodinger, who called it the characteristic trait of
quantum mechanics. Albert Einstein and Niels Bohr debated
its meaning. In recent decades, we have probed the phenomenon of
entanglement with experiments of increasing delicacy and sophistication.
Quantum information theory has taught us to think of entanglement as a
kind of information. As we will see, we can measure it, transform it from
one type to another, and use it for various purposes.
Suppose Alice and Bob are in separate places. They want to produce
pairs of entangled particles, one of which is in Alices possession and
the other in Bobs. The two do all sorts of manipulations of quantum
information at their separate locations. They send ordinary classical
messages back and forth. They are allowed to do any local operation or
classical communication (LOCC) operations. But try as they might, they
cannot create any entanglement between them. LOCC operations are
not enough to make entanglement.
One answer is that they can change the entanglement from one
form to another. Suppose the entangled pairs of qubits are not very
entangled. Their superpositions are 00 plus just a tiny amount of 11.
There is much less than 1 ebit of entanglement in them. By LOCC
operations, Alice and Bob can turn a large number of weakly entangled
pairs into a smaller number of strongly entangled pairs. This is called
entanglement concentration, and it is the entanglement version of
data compression.
The entangled pairs might also be noisy. When Alice sends her qubit to
Bob, it might undergo some decoherence. Too much decoherence and
the entanglement is lost. But it might be that Alice and Bobs qubits are
still entangled but in a noisy way. In this case, by LOCC operations, Alice
and Bob may be able to rene their impure entanglement into a smaller
amountfewer qubit pairswith more pristine entanglement. This is
called entanglement distillation, and its the entanglement version of
error correction.
In our version of the game, Alice and Bob will be interviewed separately.
We will ask only binary (yes/no) questions. There are only two possible
questions, X and Y, and each contestant will be asked only one of them.
The questions asked might be the same or different.
Bob
p(agree) X Y
X 100% 0%
Alice
Y 0% 0%
The interviewers in the game, who are information theory experts, devise
a mathematical test to see whether Alice and Bob are exchanging secret
signals. Heres the idea: If Alice is sending messages to Bob about what
happens in her interview, it might be possible to trick her into sending a
message for the interviewers.
Let QA stand for the question that Alice is askedeither X or Y. And
let RB stand for the answer that Bob giveseither yes or no. The
question is whether QA has any inuence on RB. That is, does Bobs
answer give any indication of what Alice is asked? If so, then the
interviewers could send their own messages using Alice and Bob.
For Alice and Bob, the exact amount of information about QA that
appears in RB is 0. Whichever question Alice is asked, Bobs answer will
still be yes or no with equal probability. True, Alices answer and Bobs
answer are correlated, but the interviewers do not control the answers,
Suppose Bob and Bill give different answers to their questions. Now
what if Alice is asked question Y? Because she is asked Y and Bob
is asked X, their answers will disagree. But that would mean that Bill
would agree with heragain, a violation of the table. If they are both
asked Y, they have to disagree.
Thus, either way Bill answers his own questionthe same as Bob or
differentthere is a possible question for Alice that would disprove
Bills claim to have the same special relationship with her that Bob
does. Bill must be an imposter!
Bob
p(agree) X Y
X 85% 15%
Alice
Y 15% 15%
Photon B
p(agree) XB YB
XA 85% 15%
Photon A
YA 15% 15%
The rst method for quantum key distribution was devised in 1984
by Charles Bennett of IBM and Gilles Brassard of the University of
Montreal. Their scheme is known as BB84, and its discovery marked the
birth of quantum cryptography. However, BB84 is somewhat elaborate,
and its somewhat difficult at rst to see why it does the job. A few
years later, Artur Ekert of the University of Cambridge realized that the
complications of BB84 obscured the real point of quantum cryptography:
In effect, Ekert showed that entanglement is the key. Lets see how this
might work.
On about 25% of the photon pairs, they chose the same analyzer axes.
On these entangled photons, that means that their results must disagree.
Its easy for Bob to negate his results, so that they always agree. Now
Alice and Bob each have a collection of random, shared, and completely
secret bitsin short, they have a perfect new key.
Even if Alice and Bob are suspicious of the source of the entangled
photon pairs, they can still achieve complete condence in the secrecy
of their key.
Remember, 75% of the time, they chose different axes for their two
photons. These are not as good for building a secret key because
they are not perfectly correlated.
Instead, Alice and Bob share these results with each other over their
open channel. By comparing notes, they can test the probabilities
in the Newlywed Game experiment. They can test that Q 1 and Q2
results agree only 15% of the time, that Q 1 and Q4 results agree 85%
of the time, and so on.
From this data alone, they can establish beyond any doubt that
the information relationship between their photons is completely
monogamous. No matter what tampering Eve may have
attempted, Alice and Bob know that she can possess no particle or
measurement record that shares their key information.
Quantum key distribution has also gone commercial. There are several
companies that now build and sell quantum cryptographic systems.
Some electronic banking transactions in Europe have already been
encrypted using key bits shared via quantum key distribution.
Someday, perhaps not that far off, quantum computers could make
factoring a far easier problem. If that happens, the public key systems
we use today will become insecure. Worse, all of todays encrypted
message traffic, recorded and stored somewhere, will become
readable by any adversary with a quantum computer. For secrets
that need to be kept over long time scales, one-time pad encryption
based on quantum key distribution provides the only real guarantee of
cryptographic security.
READINGS
Aczel, Entanglement.
QUESTIONS
3K\VLFVIURP,QIRUPDWLRQ 23
J
ohn Wheeler was an American physicist famous for his bold
ideas and insights about the deepest problems in physics. To
a physicist, Wheeler said, the laws of physics are mathematical
equations, but even the greatest mathematical theory of the
cosmos is missing the principle by which it can become real. Wheeler did
not claim to know what that could be, but he thought there were clues
scattered throughout the fundamental theories of physics. One important
clue, he believed, came from quantum theory and the central role that
information plays in it. Always ready with a memorable phrase, Wheeler
called his idea it from bit.
Thus, once we have made a black hole out of our data storage
region, the information is lost, and thats what limits us.
Lets look at the life cycle of a black hole. It starts out as matter
the collapsing remnant of a massive star. The hole forms, and all the
information encoded in that matter disappears behind the event horizon.
After things settle down, the black hole emits radiation for a long time. Very
slowly, it loses mass and shrinks, gradually evaporating away. Eventually
the exact details of the nal stage are not well understoodthe black hole
is gone. Heres the question: Where did the original information go?
It crossed the event horizon into the black hole, but eventually the black
hole goes away. There is some very faint quantum radiation discovered
by Hawking, but that seems to have nothing to do with the material
that went into the black hole. This is called the black hole information
problem, and it is an especially knotty puzzle in theoretical physics.
We cannot say yet where the holographic principle will lead. Many
physicists today regard it as the key to a deeper kind of physics.
Wheeler always regarded black holes as another clue to something
missing in physics. If the elementary quantum phenomenon is the birth
of information, the black hole, perhaps, is its death.
If we divide the area of that surface by the tiny area for 1 bit, we get
a huge number: about 10120 bits.
In 2001, Seth Lloyd from MIT tried to gure what such a number means.
He said that we can regard the Universe as a gigantic information-
The cosmic horizon number, 10120 bits, is possibly too large because it
should give the maximum amount of information that can be contained
within a volume the size of our Universe. But the Universe is mostly
empty. Its information content is less.
Lloyd said that we can estimate the number of bits in the cosmic
computation by estimating the total entropy of the contents of the
Universe. The lions share of that entropy is actually not in stars or gas
but in the cosmic microwave background, the faint radiation left over
from the hot, dense era soon after the big bang. The entropy of that
radiation is about 1090 bits. Thats much smaller, but it is still gigantic.
Armed with this idea, Lloyd estimated that the total number of basic
computer operations on the 1090 bits in the Universe has been
about 10120 opsthe same as the bit number for the cosmic horizon.
In fact, Lloyd showed that this is no coincidence but would hold true
in any expanding universe like ours. The horizon bit number equals
the cosmic op number.
That being true, what does it actually mean to say that pi has an
innite number of decimal places? Almost all of those digits can
have no meaning and no existence in the physical world.
The quest for an information basis for physics has recently gone in an
interesting new direction. The idea is to try to identify where quantum
mechanics itself comes from. Perhaps we can answer this question by
guring out the minimum logical basis for quantum mechanics: What is
the smallest set of postulates, of the simplest kind, that would lead to
the whole quantum mathematical machinery? Can we derive quantum
mechanics from axioms about information?
TERMS
READINGS
QUESTIONS
O
ne of the most astounding features of the science of
information is its ubiquity. Claude Shannons revolution has
spread far beyond the technology of telecommunication. The
concepts and principles he formulated have become essential
elements of sciences from biology to quantum physics. Why? What is it
about information that makes it such an indispensable idea?
Isomorphism
Actually, sending information even 200 years into the future (in a
time capsule, perhaps) poses some challenges. The question is not
basic understandingit is easy to read books from 200 years ago.
The problem is that most of the information we have and use today is
electronic in form, but most of the technologies we use to store that
information wont last for 200 years. For our time capsule, the best bet
is to use old-fashioned data storage technologies, such as print on acid-
free paper or inscriptions on metal or stone.
The scientists who studied the problem came to realize that the solution
involved a multilayered message, each layer building on the content
of the last. The rst layer is straightforward: People made this. The
second layer is also simple: Stay away from here. The third layer is
The outer layers of the message will have to be largely pictorial. One
cartoon might show someone digging, then growing very ill and dying.
A star chart could convey the notion of 10,000 years by using the
slow precession of the Earths rotation axis. Diagrams could convey
mathematical concepts, such as numbers, which could in turn be used
on blueprints and technical data. Small samples of the entombed waste
would likely be buried at much shallower depths to permit scientic study.
The danger to guard against is the assumption that the recipients share
parts of our many visual or linguistic codes. For example, we might
consider the meaning of an arrow pointing in a certain direction to be
obvious, but they might not.
Nevertheless, we can reasonably expect that we will share some
things with the future recipients of our warning message. Physical
features of the environment will stay relatively constant; the
constellations will shift but not noticeably change shape; and so on.
The notion that there could be thinking beings on other worlds has been
around since antiquity, but modern research into this idea began in 1959
with a paper entitled Searching for Interstellar Communications by
Philip Morrison and the Italian physicist Giuseppe Cocconi.
Figure 24.1
The Arecibo message was transmitted on November 16, 1974, toward the
dense globular cluster M13, about 22,000 light-years away. Given that
the signal was broadcast only once, it is unlikely that anyone in M13 will
happen to hear it. Nevertheless, the broadcast power was enormous:
1 million watts, twice as great as the largest commercial transmitter in
history and directed with an antenna gain factor of tens of millions. Any
The most interesting aspect of the Arecibo message is not the remote
chance that it will be received but the fascinating strategy used to
solve the anti-cryptography problem. The message relies on things
we must share with the intended recipients, alien though they may be:
mathematics and logic; the facts of physics, chemistry and astronomy;
and the physical characteristics of the signal itself. Anyone in the
Universe who can build a radio telescope must know these things. They
are an information basis on which we can build a shared code.
And not everyone thinks that radio signals are the most likely means
for interstellar communication. Light signals offer many advantages,
and in fact, there are several telescope-based projects underway to
search for extraterrestrial light signals.
READINGS
QUESTIONS
Lecture 1
Lecture 2
=
>
>
Note: To calculate log2( x), nd 1n( x) and divide by 1n(2) = 0.6931. Thus,
1n(3) 1.0986
log2 (3) = = = 1.585.
1n(2) 0.6931
2 Each letter has entropy of log226 = 4.70 bits, and each digit has
log210 = 3.32 bits. The total amount of car-identication entropy is,
thus: (3 4.70) + (4 3.32) = 27.38 bits. This is enough to identify about
175 million different cars; as it happens, Ohio has only about 12 million
vehicles registered statewide.
4 First, ask: Is the word in the rst half of the dictionary? Next, ask which of
the two possible quarters of the dictionary the word is in, and so on. At
any stage, you have a range of possible pages in the dictionary; the key
idea is to divide that page range in half and nd out which half is correct.
(Eventually, you narrow things down to a single page. At that point, you
ask whether the word is on the top half of the page, and so on.)
1 The answer depends on the table you consult, but for one online table,
the answer was 11the letters Y, W, G, P, B, V, K, X, Q, J, and Z combined
have about the same frequency as E. (Because the table I consulted did
not include spaces, this was about 12%.) The total probability for the 12
most common letters was about 80%.
p warm cool
Warm and cloudy, the least likely type of weather, is the most surprising.
The entropies are given by the average surprises: H(S) = 0.97 bits,
H(T) = 1.00 bit, and H(S,T) = 1.85 bits (the weighted average of the
surprises in this table).
50
p= = 9 1012 ,
300, 000, 000 365
less than one chance in 100 billion. The surprise s = log2(1/p) = 36.7 bits.
When we ip a coin many times, the surprise increases by 1.0 bit for
each heads; thus, getting 40 heads in a row has a surprise of 40 bits,
even more surprising than getting struck by lightning.
E 100 (dot-space)
T 1100 (dash-space)
A 101100 (dot-dash-space)
O 1101101100 (dash-dash-dash-space)
I 10100 (dot-dot-space)
N 110100 (dash-dot-space)
Notice that every letter must end with a 0 to give an extra-long pause
between letters. Otherwise, A (dot-dash) would be the same as E
followed by T.
Lecture 6
Lecture 7
2 For this error rate, the binary entropy function h(0.001) = 0.0114 bits. This
is the information loss per use of the channel. In a 1000-bit message,
we expect to lose about 11.4 bits. (This makes sense. Even with a single
error, it would take about log2(1000) = 9.97 bits of correction data to
locate itand there may sometimes be more than one error.)
Lecture 8
2n 27
= = 16,
n+1 8
Lecture 9
Lecture 11
1 Both are letter-exchange ciphers. That is, the cipher system simply
swaps pairs of letters. This means that the encoding and decoding
processes for a message are exactly the same.
2 A random 20-letter key phrase has H(K) = 20 4.7 bits = 94 bits. With
D = 3.2 bits/letter for English, this yields: L = H(K)/D 29 letters. This
seems unreasonably shortless than 50% longer than the code phrase
itself! This is certainly not enough for the Babbage method.
Lecture 13
1 Given that there are about 6 109 base pairs in human DNA, the total
length is (6 109) (0.33 10 9 m) = 2 m. My DNA is slightly longer than
I am.
2 One base pair means just two possibilities: AT and TA. With four bases
per codon, this gives 24 = 16 possible codons. If the genetic code is like
the one on Earth, however, at least one of these must be a stop codon,
leading to a maximum of 15 amino acids. However, this would mean that
each amino acid had only one possible codon, which means that every
mutation would change the corresponding protein. The code would be
much less fault-tolerant than ours.
Lecture 14
3 In many ways, prion diseases are more analogous to the inorganic tin
disease. The original misshapen protein can arise by accident, without
any parent.
Lecture 15
Lecture 16
Q 2000
S= = = 5.4 joules/kelvin.
T 373
Lecture 17
1 One idea would simply be to slow the computer down. In our simple
capacitor example, diminishing the speed by a factor of 10 reduced the
power output by 100 times and the total dissipation of a computation by
10 times.
3 If the gate has n input bits, then it has 2n possible input states. It needs
to have at least as many output states, because each different input
leads to a different output. Therefore, it needs at least n output bits. But
a reversible gate can be run backward, exchanging input for output
and that means that input and output bit numbers must be exactly equal.
Lecture 18
Lecture 19
Lecture 21
3 At this stage, very soon before 2019, it is probably wisest to take the
negative side of the bet, that is, against my position. On the other
hand, given the present rate of progress, 20 more years might make a
considerable difference!
Lecture 23
2 Your answers may vary. As for myself, I think that the search for good
information axioms can help us to understand the deeper structure of
quantum theory, even if we do not take any particular set of axioms
too seriously. (After all, someday, we may have a better theory than
quantum mechanics!)
Lecture 24
H = log2(M ),
where M is the number of possible messages from the source, each message
considered to be equally likely. The Shannon entropy generalizes this for a
source whose messages are not equally likely:
H = p( x )log2
1
x )
,
x p (
log2
1
p( x )
H p( x )log2
1
q( x )
.
x
The intuitive way to understand this relation is to imagine two people describing
the same information source. Fred knows the correct probabilities p(x) for the
messages, while George mistakenly thinks the probabilities are q(x). Different
probabilities lead to different measures of the surprise of the messages.
Sometimes Fred is more surprised than George, and sometimes the reverse is
true, but on average, Georges surprise is greater.
We used the information inequality to show that no prex-free code can have an
average codeword length less than the Shannon entropy (Lecture 5) and to show
that the best was to hedge horse race bets is to make our betting fraction equal
to the probability of winning for each horse in the race (Lecture 18).
Consider a set of binary codewords for the messages x from a source. The
codeword for x has length L( x) bits. Then, the code can be prex-free only if the
following is true:
x 2 L( x ) 1.
We can add two additional comments. First, if we x a set of lengths satisfying this
inequality, we can always nd a prex-free code with codewords of these lengths.
Second, we can always trim our codewords (as in Huffman coding) so that the
inequality becomes an equation, that is, the left-hand side equals 1.
I( X; Y ) = H( X ) H( X | Y ) = H( X ) + H(Y ) H( X, Y ).
Here, H( X |Y ) is the entropy of the input given the output, and H( X,Y ) is the joint
entropy of the input-output pair.
Though many powerful error-correcting codes are now known, almost none
were discovered before Shannon proved his great theorem. Knowing a thing is
possible is a crucial step to guring out how to do it!
The information capacity C of an analog signal (e.g., a radio signal) in bits per
second is:
C = W log2 1 +
P
N
,
where W is the bandwidth (in Hertz), and P/N is the ratio of the signal power (P) to
the noise power (N). Note that N usually depends on W. (A channel with a wider
bandwidth admits more noise.)
H (K )
L= ,
D
where D is the redundancy of the plaintext per symbol. For English text, Shannon
estimated that D 3.2 bits/letter.
For instance, a simple substitution cipher has a key entropy H(K) = log2(26!) 88
bits, giving a unicity distance of about 28 letters. If we have a sample of ciphertext
much longer than this, and we know in advance that a simple substitution cipher
is used, then we should be able to determine the original plaintext message.
In the earliest days of life on Earth, the simplest self-replicating molecules made
copies of themselves in a very noisy environment. This noise limited the size of
the molecular sequence that could preserve its information into the future.
Le log(S) 1,
where S is the survival advantage of the correct sequence compared to its imperfect
copies. The estimate log(S) 1 is a heuristic value; we do not expect log(S) to be
much larger than 1 in the conditions of the earliest self-replicating molecules.
S = k 2log2(M ) = k2H,
Entropy had been discovered earlier by Clausius, and we can determine how it
changes from the macroscopic processes a system undergoes. If heat Q is added
to a system of absolute temperature T (in kelvins), then the entropy changes by
S = Q/T.
Landauer studied the basic physics of computer operations, including those that
erase information. He found that there is an inescapable thermodynamic cost for
erasing data. If information is erased, the environment must increase its entropy
by at least S k2 per bit. If the environment has absolute temperature T, then
this corresponds to a waste heat of at least k2T per bit.
Suppose we are betting on a horse race with ideal equal odds, and we distribute
our bets among the possible results in the best possible way (that is, betting a
fraction p( x) of our wealth on each outcome x). Then a tip containing information I
about the race result will allow us to achieve a gain factor per race of G, given by:
log2(G) = I (in exponential form, G = 2I).
With no tip, I = 0 and, thus, G = 1. We do not expect either to win or lose money
in the long run. However, if the tip tells us the result of a binary horse race, then
I = 1 bit and G = 2. We expect to double our money per race. Less reliable or
informative tips provide a smaller advantage.
This G is the best expected gain factor in the long run. But anyone who actually
bets by Kellys log-optimal strategy should expect some dramatic ups and downs
in wealth along the way.
K(s) = L(s*),
where L( x) is the length of string x in bits. If s has a pattern, then K(s) = L(s). If s is
random, then K(s) L(s).
Zurek suggested that the thermodynamic entropy of any system that includes
a computer should be the sum of two parts: the regular Boltzmann entropy,
k 2log2(M ), and a new algorithmic term, k2K(m), taking into account the particular
memory bits m. However, because of the Berry paradox, the algorithmic
information is an uncomputable function, like Turings halting function and the
halting probability .
Prob(s) = 2 L( p ) 2 L( s*) = 2 K ( s ) .
p in P( s )
This provides an absolute sense in which some strings s are more likely than
others. Ray Solomonoff used this idea to give a new perspective on Ockhams
razor, the logical principle that simpler theories are more likely theories. More
likely in what way? In monkey probability!
where M is the total number of histories that could have led to the present black
hole, and thus, log2(M ) is the number of bits that have vanished into the hole.
Ahorizon is the area of the black hole event horizon. As discovered by Hawking, the
constant above is such that 1 bit of missing information corresponds to an area of
about 7 10 70 m2.
A black hole the mass of our Sun has a radius of about 3 km, yielding a horizon
surface area of more than 100 million square meters. Such a black hole would
correspond to a truly vast missing information of more than 1077 bits.
The question of whether this information is truly lost or whether it will later
reemerge via quantum radiation from the black hole, is called the black hole
information problem.
The same connection between information and area motivates the holographic
principle, according to which we can suppose that the information content of any
region of space is really located on its outer boundary, with each bit occupying
no more than 7 10 70 m2.
332 Glossary
bandwidth: The range of frequencies used by an analog signal, be it sound,
radio, or an electrical voltage. Every communication channel has a limited
bandwidth, a nite range of frequencies that can be conveyed faithfully.
Along with the signal-to-noise ratio, the bandwidth limits the information
rate of an analog signal, according to the Shannon-Hartley formula.
Baudot code: Teletype code rst introduced by mile Baudot in 1870 and
later much revised for international use, with each letter represented by a
codeword of 5 bits and additional symbols (such as numerals) expressed
by the use of a special codeword that shifts to a second, alternative set of
codewords. For data storage, the 1s and 0s can be represented by small
holes punched in a paper tape.
black hole information problem: Does information that falls into a black hole
ultimately disappear from the Universe, or does it eventually reemerge in
the quantum radiation from the hole? The subject of a famous bet, in which
physicist John Preskill, later joined by Stephen Hawking, argued that the
information eventually returns. Kip Thorne was reportedly not yet convinced.
Boolean logic: The algebraic logic devised by George Boole in the 19th
century, used as the basis for the design of fully electronic computers
in Claude Shannons 1937 masters thesis. Shannon showed that any
mathematical calculation could be reduced to a complex network of Boolean
operations and that these could be implemented via electrical circuits.
Braille code: The rst binary code, introduced as a system of writing for blind
readers by Louis Braille in 1837. Each letter is represented by a codeword of
6 bits, arranged as a rectangular pattern of raised dots on paper. Additional
symbols are available by the use of special shift symbols.
334 Glossary
codon: A group of three successive bases in DNA, forming a codeword that
represents a single amino acid in a protein. With four base pairs (AT, TA,
CG, and GC), there are 64 possible codons, which is more than enough to
represent the 20 possible amino acids. (There are also start and stop
codons to designate the ends of the protein sequences.)
eidostate: A neologism (based on the Greek word meaning to see) for the
state of a thermodynamic system, including all of the information available to
Maxwells demon, whether it is macroscopic or microscopic.
336 Glossary
denitions of entropy. The algorithmic entropy of Andrey Kolmogorov and
Gregory Chaitin is the length of the shortest computer program that can
produce a particular binary string as its output. In quantum mechanics, the
von Neumann entropy (formulated by John von Neumann) takes into account
that nearby quantum states are not fully distinguishable.
full adder: A combination of Boolean logic gates that computes the sum of
three 1-bit inputs and produces the answer as a 2-bit binary number. Because
the full adder can include the results of a carry operation, a cascade of
these devices can accomplish the addition of two binary numbers of any
specied length.
Huffman code: A prex-free binary code that is the most efficient code for a
given set of message probabilities. Devised by MIT graduate student David
Huffman in 1952.
338 Glossary
those used by radio telescopes, are often cooled to a low temperature to
reduce this internal source of interference.
JPEG: A exible and highly efficient data-compression format for many kinds
of image data, introduced by the Joint Photographic Experts Group in the
1990s. There is relatively little noticeable loss of image quality except for
cartoons and diagrams involving ne lines and lettering, which tend to be
surrounded by unwanted artifacts in the coded picture.
340 Glossary
fact that successive frames of a video usually have only small differences.
One form of MPEG-4, also known as the H.264 format, is the usual standard
for streaming video and Blu-Ray disks.
neuron: The basic unit of the brain for information processing and
communication. Each neuron consists of a main body called the soma and
a long ber called an axon. Connections from other neurons induce the
soma to produce nerve impulses, which are transmitted down the axon and
inuence other neurons connected at its end.
342 Glossary
quantum entanglement: A monogamous information relationship between
quantum particles, in which neither particle by itself is in a denite quantum
state, but the two together do have a denite quantum state.
redundancy: The property of language that uses many more bits (measured
by the number of letters) to represent information than strictly necessary.
This gives natural languages the property of error correction, but it also
provides the basic vulnerability of a cipher system to cryptanalysis.
344 Glossary
signal-to-noise ratio: Also known as SNR, the ratio of the power of a signal
to the power of the interfering noise. This is often described logarithmically,
using decibels. Along with the signal bandwidth, the signal-to-noise ratio
limits the information rate of an analog signal, according to the Hartley-
Shannon entropy.
XOR: The exclusive OR operation in Boolean logic. When both input bits
agree (00 or 11), then the output is 0; when they disagree (01 or 10), the
output is 1.
Zipfs law: An empirical fact about language rst noted by the linguist
George Kingsley Zipf in 1935. Simply stated, in any long sample of English
text, the Nth most common word is about N times less likely than the most
common word. Similar regularities occur in other languages.
346 Glossary
%LEOLRJUDSK\
PRINT RESOURCES
Avery, John Scales. Information Theory and Evolution. 2nd ed. World
Scientic, 2012. Written by an eminent theoretical chemist, this book is
an ambitious attempt to bring together ideas of information science to
understand the essential issues of biological and cultural evolution, as
well as the implications of new technologies. Avery does a good job at the
difficult task of combining technical sophistication with accessibility.
Davies, Paul. The Fifth Miracle. Penguin, 1998. This excellent book about the
origin of life comes from an eminent physicist who has brought his insight
and training to bear on a remarkably wide range of scientic questions, from
cosmology to cancer. He is also an award-winning writer and broadcaster on
scientic subjects.
348 Bibliography
Doody, Dave. Basics of Space Flight. Bluroof Press, 2011. Doody, a NASA
engineer specializing in spacecraft operations, originally wrote this as an
online primer on all aspects of space missions; later, it was published as an
excellent book. Doodys discussions of telecommunications and the Deep
Space Network will be particularly rewarding to students of this course.
Hamming, Richard. Coding and Information Theory. Prentice Hall, 1980. This
is my favorite of the many mathematical textbooks on information theory,
and it is written by one of the discoverers of error-correcting codes. Though
advances in the coding eld have rendered it slightly out of date, it remains
valuable for its clear approach and clean mathematical arguments.
Hinsley, F. H., and Alan Stripp, eds. Codebreakers: The Inside Story of
Bletchley Park. Oxford, 1993. This is a fascinating account of the British
effort to break the German Enigma code during World War II, written by the
participants. It covers both the technical side and the human side: What
was it like to work endless hours on a super-secret project that affected the
whole outcome of the war?
Leff, Harvey S., and Andrew F. Rex, eds. Maxwells Demon 2. Institute of
Physics, 2003. Following the publication of a previous (and also excellent)
historical collection of papers about Maxwells demon in 1990, Professors
350 Bibliography
Renyi, Alfred. A Diary on Information Theory. Wiley, 1987. Hungarian
mathematician Alfred Renyi made profound contributions to probability and
information theory. In this quirky and charming book, he pens the imaginary
diary of a mathematics student coming to grips with Shannons theory for
the rst time. (Later chapters touch on a wide variety of mathematical ideas.)
Seife, Charles. Decoding the Universe. Viking, 2006. This is an excellent book
on Shannons information theory and the adoption of its ideas into biology
and physics. Seife, an accomplished science journalist, knows his territory
well and chooses his topics and examples to great effect. Even an expert can
learn something from his lucid explanations and appealing illustrations.
Siegfried, Tom. The Bit and the Pendulum. Wiley, 2000. Siegfried is a truly
excellent science journalist and served for many years as the editor of
Science News. His book is a highly readable account of the invasion of
information ideas into physics and biology in the 1980s and 1990s.
Wheeler, John Archibald, and Kenneth W. Ford. Geons, Black Holes and
Quantum Foam: A Life in Physics. Norton, 2000. John Wheelers physics
memoir, written with Ken Ford, is a wonderful introduction to his amazing
career and unique style of thinking. Students of this course will be
particularly interested in the later chapters, in which Wheeler discusses his
quest for the information basis of physics: it from bit.
ONLINE RESOURCES
352 Bibliography
The Logic Lab. www.neuroproductions.be/logic-lab/. A fun and enlightening
online program for experimenting with logic gates and ip-op circuits.
46 .................................................................... Studio-Annika/iStock/Thinkstock.
106 ........................................................................ Stocktrek Images/Thinkstock.
106 ................................................................................................................. NASA.
110 .......................................................................... MarioGuti/iStock/Thinkstock.
120 ........................................................................... Central Intelligence Agency.
130 ............................................ Daderot/Wikimedia Commons/Public Domain.
141 ........................................................................................................ Archive.org.
162 ................................................................... United States Geological Survey.
194 .............................................................................. Depositphotos.com/Nicku.
296 ........................................................................ vintagedept/ickr/CC BY 2.0.
300 ......................................................... Isaac Ruiz Santana/iStock/Thinkstock.