Poems That Solve Puzzles The History and Science of Algorithms
Poems That Solve Puzzles The History and Science of Algorithms
POEMS THAT
SOLVE PUZZLES
The History and Science of Algorithms
Chris Bleakley
1
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Chris Bleakley 2020
The moral rights of the author have been asserted
First Edition published in 2020
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2020933199
ISBN 978–0–19–885373–2
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
algorithm, noun:
A process or set of rules to be followed in calculations or other
problem-solving operations, especially by a computer.
The Arabic source, al-Kwārizmī ‘the man of Kwārizm’ (now
Khiva), was a name given to the ninth-century mathemati-
cian Abū Ja’far Muhammad ibn Mūsa, author of widely trans-
lated works on algebra and arithmetic.
Oxford Dictionary of English, 2010
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Foreword
This book is for people that know algorithms are important, but have
no idea what they are.
The inspiration for the book came to me while working as Director
of Outreach for UCD’s School of Computer Science. Over the course
of hundreds of discussions with parents and secondary school students,
I realized that most people are aware of algorithms, thanks to extensive
media coverage of Google, Facebook, and Cambridge Analytica. How-
ever, few know what algorithms are, how they work, or where they
came from. This book answers those questions.
The book is written for the general reader. No previous knowledge
of algorithms or computers is needed. However, even those with a
degree in computing will, I think, find the stories herein surprising,
entertaining, and enlightening. Readers with a firm grasp of what an
algorithm is might like to skip the introduction. My aim is that readers
enjoy the book and learn something new along the way.
My apologies to the great many people who were involved in the
events described herein, but who are not mentioned by name. Almost
every innovation is the product of a team working together, building
on the discoveries of their predecessors. To make the book readable
as a story, I tend to focus on a small number of key individuals. For
more detail, I refer the interested reader to the papers cited in the
bibliography.
In places, I favour a good story over mind-numbing completeness.
If your favourite algorithm is missing, let me know and I might slip it
into a future edition. When describing what an algorithm does, I use the
present tense, even for old algorithms. I use plural pronouns in place of
gender-specific singular pronouns. All dollar amounts are US dollars.
Many thanks to those that generously gave permission to use of their
photographs and quotations. Many thanks also to my assistants in this
endeavour: my first editor, Eoin Bleakley; my mentor, Michael Sheri-
dan (author); my wonderful agent, Isabel Atherton; my bibliography
wrangler, Conor Bleakley; my ever-patient assistant editor, Katherine
Ward; everyone at Oxford University Press; my reviewers, Guénolé
Silvestre and Pádraig Cunningham; and, last, but certainly not least, my
parents and my wife. Without their help, this book would not have been
possible.
Read on and enjoy!
Chris
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Contents
Introduction 1
1 Ancient Algorithms 9
2 Ever-Expanding Circles 25
3 Computer Dreams 39
4 Weather Forecasts 55
5 Artificial Intelligence Emerges 75
6 Needles in Haystacks 93
7 The Internet 117
8 Googling the Web 143
9 Facebook and Friends 159
10 America’s Favourite Quiz Show 171
11 Mimicking the Brain 179
12 Superhuman Intelligence 203
13 Next Steps 215
Appendix 229
Notes 233
Permissions 241
Bibliography 243
Index 259
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Introduction
‘One for you. One for me. One for you. One for me.’ You are in the
school yard. The sun is shining. You are sharing a packet of sweets with
your best friend. ‘One for you. One for me.’ What you didn’t realize back
then was that sharing your sweets in this way was an enactment of an
algorithm.
An algorithm is a series of steps that can be performed to solve an
information problem. On that sunny day, you used an algorithm to
share your sweets fairly. The input to the algorithm was the number of
sweets in the packet. The output was the number of sweets that you and
your friend each received. If the total number of sweets in the packet
happened to be even, then both of you received the same number of
sweets. If the total was odd, your friend ended up with one sweet more
than you.
An algorithm is like a recipe. It is a list of simple steps that, if followed,
transforms a set of inputs into a desired output. The difference is that
an algorithm processes information, whereas a recipe prepares food.
Typically, an algorithm operates on physical quantities that represent
information.
Often, there are alternative algorithms for solving a given problem.
You could have shared your sweets by counting them, dividing the
total by two in your head, and handing over the correct number of
sweets. The outcome would have been the same, but the algorithm—
the means of obtaining the output—would have been different.
An algorithm is written down as a list of instructions. Mostly, these
instructions are carried out in sequence, one after another. Occasion-
ally, the next instruction to be performed is not the next sequential
step but an instruction elsewhere in the list. For example, a step may
require the person performing the algorithm to go back to an earlier
step and carry on from there. Skipping backwards like this allows
repetition of groups of steps—a powerful feature in many algorithms.
The steps, ‘One for you. One for me.’ were repeated in the sweet sharing
algorithm. The act of repeating steps is known as iteration.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
2 Introduction
If the number of sweets in the packet was even, the following iterative
algorithm would have sufficed:
Like all good algorithms, this one is neat and achieves its objective in an
efficient manner.
4 Introduction
the biggest stack contains just five books. The piles are then sorted
separately using Insertion Sort. Finally, the sorted piles are transferred,
in order, to the shelf.
For maximum speed, the pivot letters should split the piles into two
halves.
Let’s say that the original pile contains books from A to Z. A good
choice for the first pivot would likely be M. This would give two new
piles: A–L and M–Z (Figure I.2). If the A–L pile is larger, it will be split
next. A good pivot for A–L might be F. After this split, there will be
three piles: A–E, F–L, and M–Z. Next, M–Z will be split and so on. For
twenty books, the final piles might be: A–C, D–E, F–L, M–R, and S–Z.
These piles are ordered separately using Insertion Sort and the books
transferred pile-after-pile onto the shelf.
The complete Quicksort algorithm can be written down as follows:
6 Introduction
8 Introduction
1
Ancient Algorithms
The desert has all but reclaimed Uruk. Its great buildings are almost
entirely buried beneath accretions of sand, their timbers disintegrated.
Here and there, clay brickwork is exposed, stripped bare by the wind or
archaeologists. The abandoned ruins seem irrelevant, forgotten, futile.
There is no indication that seven thousand years ago, this land was the
most important place on Earth. Uruk, in the land of Sumer, was one of
the first cities. It was here, in Sumer, that civilization was born.
Sumer lies in southern Mesopotamia (Figure 1.1). The region is
bounded by the Tigris and Euphrates rivers, which flow from the
mountains of Turkey in the north to the Persian Gulf in the south.
Today, the region straddles the Iran–Iraq border. The climate is hot and
dry, and the land inhospitable, save for the regular flooding of the river
plains. Aided by irrigation, early agriculture blossomed in the ‘land
between the rivers’. The resulting surplus of food allowed civilization
to take hold and flourish.
The kings of Sumer built great cities—Eridu, Uruk, Kish, and Ur.
At its apex, Uruk was home to sixty thousand people. All of life was
there—family and friends, trade and religion, politics and war. We know
this because writing was invented in Sumer around 5,000 years ago.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
10 Ancient Algorithms
Tigris
River
Euphrates
River
Alexandria Babylon
BABYLONIA
Uruk
SUMER Ur
Eridu
Figure 1.1 Map of ancient Mesopotamia and the later port of Alexandria.
Etched in Clay
It seems that writing developed from simple marks impressed on wet
clay tokens. Originally, these tokens were used for record keeping and
exchange. A token might equate to a quantity of gain or a headcount
of livestock. In time, the Sumerians began to inscribe more complex
patterns on larger pieces of clay. Over the course of centuries, simple
pictograms evolved into a fully formed writing system. That system is
now referred to as cuneiform script. The name derives from the script’s
distinctive ‘wedge shaped’ markings, formed by impressing a reed stylus
into wet clay. Symbols consisted of geometric arrangements of wedges.
These inscriptions were preserved by drying the wet tablets in the sun.
Viewed today, the tablets are aesthetically pleasing—the wedges thin
and elegant, the symbols regular, the text neatly organized into rows
and columns.
The invention of writing must have transformed these communities.
The tablets allowed communication over space and time. Letters could
be sent. Deals could be recorded for future reference. Writing facilitated
the smooth operation and expansion of civil society.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Uncovered at Last 11
Uncovered at Last
European archaeologists began to investigate the ruins of Mesopotamia
in the nineteenth century. Their excavations probed the ancient sites.
The artefacts they unearthed were shipped back to Europe for inspec-
tion. Amongst their haul lay collections of the inscribed clay tablets.
The tablets bore writing of some sort, but the symbols were now
incomprehensible.
Assyriologists took to the daunting task of deciphering the un-
known inscriptions. Certain oft repeated symbols could be identified
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
12 Ancient Algorithms
Uncovered at Last 13
poems and magical curses. Amid the flotsam and jetsam of daily life,
scholars stumbled upon the algorithms of ancient Mesopotamia.
Many of the extant Mesopotamian algorithms were jotted down by
students learning mathematics. The following example dates from the
Hammurabi dynasty (1,800 to 1,600 bce), a time now known as the Old
Babylonian period. Dates are approximate; they are inferred from the
linguistic style of the text and the symbols employed. This algorithm
was pieced together from fragments held in the British and Berlin State
Museums. Parts of the original are still missing.
The tablet presents an algorithm for calculating the length and
width of an underground water cistern. The presentation is formal and
consistent with other Old Babylonian algorithms. The first three lines
are a concise description of the problem to be solved. The remainder
of the text is an exposition of the algorithm. A worked example is
interwoven with the algorithmic steps to aid comprehension. 5
A cistern.
The height is 3.33, and a volume of 27.78 has been excavated.
The length exceeds the width by 0.83.
You should take the reciprocal of the height, 3.33, obtaining 0.3.
Multiply this by the volume, 27.78, obtaining 8.33.
Take half of 0.83 and square it, obtaining 0.17.
Add 8.33 and you get 8.51.
The square root is 2.92.
Make two copies of this, adding to the one 0.42 and subtracting
from the other.
You find that 3.33 is the length and 2.5 is the width.
This is the procedure.
14 Ancient Algorithms
Simply taking the square root of this area would give the length and
width of a square base. An adjustment must be made to create the
desired rectangular base. Since a square has minimum area for a given
perimeter, the desired rectangle must have a slightly larger area than
the square base. The additional area is calculated as the area of a square
with sides equal to half the difference between the desired length and
width. The algorithm adds this additional area to the area of the square
base. The width of a square with this combined area is calculated. The
desired rectangle is formed by stretching this larger square. The lengths
of two opposite sides are increased by half of the desired length–width
difference. The length of the other two sides is decreased by the same
amount. This produces a rectangle with the correct dimensions.
Decimal numbers are used the description above. In the original, the
Babylonians utilized sexagesimal numbers. A sexagesimal number system
possesses sixty unique digits (0–59). In contrast, decimal uses just ten
digits (0–9). In both systems, the weight of a digit is determined by
its position relative to the fractional (or decimal) point. In decimal,
moving right-to-left, each digit is worth ten times the preceding digit.
Thus, we have the units, the tens, the hundreds, the thousands, and
so on. For example, the decimal number 421 is equal to four hundreds
plus two tens plus one unit. In sexagesimal, moving right-to-left from
the fractional point, each digit is worth sixty times the preceding one.
Conversely, moving left-to-right, each column is worth a sixtieth of the
previous one. Thus, sexagesimal 1,3.20, means one sixty plus three units
plus twenty sixtieths, equal to 63 20
60 or 63.333 decimal. Seemingly, the
sole advantage of the Old Babylonian system is that thirds are much
easier to represent than in decimal.
To the modern reader, the Babylonian number system seems bizarre.
However, we use it every day for measuring time. There are sixty
seconds in a minute and sixty minutes in an hour. The time 3:04 am
is 184 (3 × 60 + 4 × 1) minutes after midnight.
Babylonian mathematics contains three other oddities. First, the frac-
tional point wasn’t written down. Babylonian scholars had to infer its
position based on context. This must have been problematic—consider
a price tag with no distinction between dollars and cents! Second, the
Babylonians did not have a symbol for zero. Today, we highlight the
gap left for zero by drawing a ring around it (0). Third, division was
performed by multiplying by the reciprocal of the divisor. In other
words, the Babylonians didn’t divide by two, they multiplied by a half.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Uncovered at Last 15
Figure 1.2 Yale Babylonian Collection tablet 7289. (YBC 7289, Courtesy of the Yale
Babylonian Collection.)
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
16 Ancient Algorithms
Let’s say that the algorithm begins with the extremely poor guess of:
2.
Dividing 2 by 2 gives 1. Adding 2 to this, and dividing by 2 gives:
1.5.
Dividing 2 by 1.5 gives 1.333. Adding 1.5 to this and dividing by 2 again
gives:
1.416666666.
Repeating once more gives:
1.41421568.
which is close to the true value.
How does the algorithm work? Imagine that you know the true value
for the square root of two. If you divide two by this number, the result
is exactly the same value—the square root of two.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Uncovered at Last 17
Now, imagine that your guess is greater than the square root of two.
When you divide two by this number, you obtain a value less than the
square root of two. These two numbers frame the true square root—
one is too large, the other is too small. An improved estimate can be
obtained by calculating the average of these two numbers (i.e. the sum
divided by two). This gives a value midway between the two framing
numbers.
This procedure—division and averaging—can be repeated to further
refine the estimate. Over successive iterations, the estimates converge
on the true square root.
It is worth noting that the process also works if the guess is less than
the true square root. In this case, the number obtained by means of
division is too large. Again, the two values frame the true square root.
Even today, Heron’s method is used to estimate square roots. An
extended version of the algorithm was utilized by Greg Fee in 1996 to
confirm enumeration of the square root of two to ten million digits.
Mesopotamian mathematicians went so far as to invoke the use of
memory in their algorithms. Their command ‘Keep this number in
your head’ is an antecedent of the data storage instructions available
in a modern computer.
Curiously, Babylonian algorithms do not seem to have contained
explicit decision-making (‘if–then–else’) steps. ‘If–then’ rules were,
however, used by the Babylonians to systematize non-mathematical
knowledge. The Code of Hammurabi, dating from 1754 bce, set out 282
laws by which citizens should live. Every law included a crime and a
punishment: 8
If a son strike a father, they shall cut off his fingers.
If a man destroy the eye of another man, they shall destroy his eye.
If–then constructs were also used to capture medical knowledge and
superstitions. The following omens come from the library of King
Ashurbanipal in Nineveh around 650 bce: 9
If a town is set on a hill, it will not be good for the dweller within that
town.
If a man unwittingly treads on a lizard and kills it, he will prevail over his
adversary.
Despite the dearth of decision-making steps, the Mesopotamians
solved a wide variety of problems by means of algorithms. They
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
18 Ancient Algorithms
20 Ancient Algorithms
Finding Primes 21
inputs has to be smaller than the larger of the two numbers. Replacing
the larger number with the difference means that the pair of numbers
is reduced. In other words, the pair is getting closer to the GCD. At
all times, the pair, and their difference, are multiples of the GCD.
Over several iterations, the difference becomes smaller and smaller.
Eventually, the difference is zero. When this happens, the numbers are
equal to the smallest possible multiple of the GCD, that is, the GCD
times 1. At this point, the algorithm outputs the result and terminates.
This version of Euclid’s algorithm is iterative. In other words, it con-
tains repeating steps. Euclid’s algorithm can, alternatively, be expressed
recursively. Recursion occurs when an algorithm invokes itself. The idea is
that every time the algorithm calls itself, the inputs are simplified. Over
a number of calls, the inputs become simpler and simpler until, finally,
the answer is obvious. Recursion is a powerful construct. The recursive
version of Euclid’s algorithm operates as follows:
Finding Primes
In the third century bce, Eratosthenes (c. 276–195 bce) was appointed
Director of the Library of Alexandria. Born in Cyrene, a North African
city founded by the Greeks, Eratosthenes spent most of his early years
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
22 Ancient Algorithms
List the numbers that you wish to find primes among, starting
at 2.
Repeat the following steps:
Find the first number that hasn’t been circled or crossed out.
Circle it.
Cross out all multiples of this number.
Stop repeating when all the numbers are either circled or
crossed out.
The circled numbers are prime.
Imagine trying to find all of the primes up to fifteen. The first step is
to write down the numbers from 2 to 15. Next, circle 2 and cross out its
multiples: 4, 6, 8, and so on.
2 , 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Finding Primes 23
Then, circle 3 and cross out all of its multiples: 6, 9, 12, 15.
2 , 3 , 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Four is already crossed out, so the next number to circle is 5, and so it
goes. The final list is:
2 , 3 , 4, 5 , 6, 7 , 8, 9, 10, 11 , 12, 13 , 14, 15
The numbers that pass through the sieve (that is, are circled) are prime.
One of the neat aspects of the Sieve of Eratosthenes is that it does not
use multiplication. Since the multiples are generated sequentially, one
after another, they can be produced by repeatedly adding the circled
number to a running total. For example, the multiples of 2 can be
calculated by repeatedly adding two to a running total, giving 4, 6, 8,
and so on.
A drawback of the Sieve is the amount of storage that it needs. To
produce the first seven primes, eighteen numbers have to be stored. The
amount of storage can be reduced by only recording whether a number
has been struck off or not. Nevertheless, the storage complexity of the Sieve
becomes a problem for large numbers of primes. An up-to-date laptop
computer can find all primes with fewer than eight decimal digits using
the Sieve of Eratosthenes. In contrast, as of March 2018, the largest
known prime number contains a whopping 23,249,425 decimal digits.
For three hundred years, the Museum of Alexandria was a beacon
of teaching and learning. Thereafter, slow decline was punctuated by
disaster. In 48 bce, Julius Caesar’s army put their vessels to the torch
in Alexandria’s harbour in a desperate attempt to stall Ptolemy XIV’s
forces. The fire spread to the docks and parts of the Library were
damaged in the resulting conflagration. The Museum was damaged in
an Egyptian revolt in 272 ce. The Temple of Serapis was demolished in
391 ce by the order of Coptic Christian Pope Theophilus of Alexandria.
The female mathematician Hypatia was murdered in 415 ce by a Chris-
tian mob. The Library was finally destroyed when General Amr ibn
al- As.al–Sahmī’s army took control of the city in 641 ce.
While the Museum of Alexandria was the pre-eminent centre of
learning in the ancient Greek world for six centuries, it was not the only
stronghold of logic and reason. On the other side of the Mediterranean,
a lone genius invented a clever algorithm for calculating one of the
most important numbers in all of mathematics. His algorithm was to
outshine all others for nearly a thousand years.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
OUP CORRECTED PROOF – FINAL, 14/7/2020, SPi
2
Ever-Expanding Circles
26 Ever-Expanding Circles
piece of string across a circle’s diameter and compare that length to the
circumference. You will find that the circle’s circumference is slightly
more than three times its diameter. Repeated measurement shows that
this ratio is constant for all circle sizes. Of course, ‘slightly more than
three times’ isn’t particularly satisfactory from a mathematical point
of view. Mathematicians want precise answers. Determining the precise
ratio of a circle’s circumference to its diameter is a never-ending quest.
The exact ratio—whatever its true value—is today represented by
the Greek letter π (pronounced ‘pi’). The letter π was first used in this
way, not by the ancient Greeks, but by a Welshman—mathematician
William Jones in 1707.
Writing down the true value of π is impossible. Johann Heinrich
Lambert proved that π is an irrational number, meaning that enumeration
requires infinitely many digits (1760s). No matter how far enumeration
goes, the digits never settle into a repeating pattern. The best anyone
can do is approximate π.
After the first few integers, π is arguably the most important number
in mathematics. Without π, we would struggle to reason about circles
and spheres. Circular motion, rotations, and vibrations would become
mathematical conundrums. The value of π is employed in many prac-
tical applications ranging from construction to communication, and
from spaceflight to quantum mechanics.
The original estimate of 3 is correct to 1 digit. By about 2,000 bce,
the Babylonians had estimated π as 258 = 3.125, accurate to two digits.
The Egyptian Rhind Papyrus offers an improved approximation of
81 = 3.16049, close to three-digit accuracy. However, the first real
256
28 Ever-Expanding Circles
Figure 2.1 A circle with an inner hexagon (left) and a circle with an
outer hexagon (right). The inner hexagon includes its constituent equilateral
triangles.
another estimate for π equal to 3.46410. This estimate is close to the true
value but is a little too large.
Archimedes improved these approximations by means of an algo-
rithm. Every iteration of the algorithm doubles the number of sides
in the two polygons. The more sides a polygon has, the better its
approximation to π.
The algorithm operates as follows:
In the first iteration, the algorithm turns the hexagons into do-
decagons (12-sided figures). This gives improved estimates for π of
3.10582 (inner polygonal) and 3.21539 (outer polygon) to six digits.
The beauty of Archimedes’ algorithm is that it can be applied again.
The outputs from one run can be fed into the algorithm as inputs to
the next iteration. In this way, the dodecagons can be transformed
OUP CORRECTED PROOF – FINAL, 14/7/2020, SPi
World Records 29
World Records
Archaeological evidence suggests that civilization emerged in China at
around the same time as it appeared in Mesopotamia and Egypt. Urban
society in China appears to have first developed along the banks of the
Yangtze and Yellow Rivers. Little is known of early Chinese mathemat-
ics as the bamboo strips used for writing at the time were perishable.
Although there was intercommunication between East and West, it
appears that Chinese mathematics developed largely independently.
The oldest extant Chinese mathematical text—Zhoubi Suanjing—dates
to around 300 bce. The book focuses on the calendar and geometry, and
includes the Pythagorean Theorem. A compendium of mathematical
problems analogous to the Rhind Papyrus—Nine Chapters on the Mathe-
matical Art—survives from roughly the same period.
It seems that the search for π was much more determined in China
than in the West. In 264 ce, Liu Hui used a ninety-six-sided inner
polygon to obtain an approximation of 3.14—accurate to three figures.
He later extended his method to a polygon of 3,072 sides, obtaining an
improved estimate of 3.14159—accurate to six figures.
Zu Chongzhi (430–501 ce), assisted by his son Zu Chengzhi, pro-
duced an even better estimate in the fifth century ad. The father
and son duo used a polygonal method similar to Archimedes’ ap-
proach. However, they persevered for many more iterations. Their up-
per and lower bounds of 3.1415927 and 3.1415926 were accurate to seven
OUP CORRECTED PROOF – FINAL, 14/7/2020, SPi
30 Ever-Expanding Circles
al-Khwārizmī’s text On the Hindu Art of Reckoning (c. 825) describes the
decimal number system, including the numerals that we employ today.
The system’s roots lie in the Indus Valley civilization (now southern
Pakistan), which blossomed around 2,600 bce—roughly contempora-
neous with the construction of the pyramids in Giza. Little is known
of the original mathematics of the region, save what can be gleaned
from religious texts. Inscriptions suggest that the nine non-zero Hindu–
Arabic numerals (1–9) appeared in the region between the third century
bce and the second century ce. Certainly, there is a clear reference to the
nine non-zero Hindu numerals in a letter from Bishop Severus Sebokht,
who lived in Mesopotamia around 650 ce. The numeral for zero (0)
finally appeared in India at around the same time.
By the eighth century, many Persian scholars had adopted the
Hindu–Arabic number system by reason of its great convenience—
hence, al-Khwārizmī’s book on the subject. His text became a conduit
for the Hindu–Arabic numerals in their transfer to the West. On the
Hindu Art of Reckoning was translated in 1126 from Arabic into Latin by
Adelard of Bath–an English natural philosopher. Adelard’s translation
was followed by Leonardo of Pisa’s (Fibonacci) book on the topic—Liber
Abaci—in 1202.
In 1258, four hundred years after al-Khwārizmī’s death, The House of
Wisdom was destroyed in the Mongol sack of Baghdad.
Surprisingly, uptake of the new number system was slow. It would be
centuries before Roman numerals (I, II, III, IV, V, …) were displaced by
the Hindu–Arabic digits. It seems that European scholars were perfectly
happy to perform calculations on an abacus and record the results
using Roman numerals. Decimal numbers only became the preferred
option in the sixteenth century with the transition to pen and paper
calculation.
It is al-Khwārizmī’s name in the title of a Latin translation of his
book—Algoritmi de Numero Indorum—that gives us the English word
‘algorithm’.
32 Ever-Expanding Circles
the printing press in the fifteenth century further spurred the spread
of scholarship and learning.
The ensuing Enlightenment of the eighteenth century saw a revolu-
tion in Western philosophy. Centuries of dogma were swept away by the
strictures of evidence and reason. Mathematics and science became the
foundations of thought. Technological progress altered the very fabric
of society. Democracy and the pursuit of individual liberty were in the
ascendency.
Changing attitudes, heavy taxation, and failed harvests were to ignite
the French Revolution in 1789. Amidst the bloody upheaval, a French
mathematician laid the theoretical foundations for what was to become
one of the world’s most frequently used algorithms.
In 1768, Jean-Baptiste Joseph Fourier (Figure 2.2) was born in Auxerre,
France. Orphaned at age nine, Fourier was educated in local schools
run by religious orders. The lad’s talent for mathematics became ob-
vious as he entered his teenage years. Nonetheless, the young man
undertook to train for the priesthood. On reaching adulthood, Fourier
abandoned the ministry to devote his career to mathematics, taking
up employment as a teacher. Soon, he became entangled in the po-
litical upheaval sweeping the country. Inspired by the ideals of the
34 Ever-Expanding Circles
Figure 2.3 Three harmonics (left) and the waveform resulting from their
summation (right). The second harmonic is scaled by a half and the third
harmonic is delayed by a quarter cycle.
Next, we double the speed of the machine again. This time, the period
is a quarter of the length of the pool. This is the third harmonic.
Another doubling and we get the fourth harmonic, and so on.
This sequence of harmonics is called the Fourier series.
Fourier’s remarkable idea was that all waveforms—of any shape
whatsoever—are the sums of scaled and delayed harmonics. Scaling
means increasing or decreasing the size of the waveform. Scaling up
makes the peaks higher and the troughs lower. Scaling down does the
converse. Delaying a waveform means shifting it in time. A delay means
that the wave’s crests and troughs arrive later than before.
Let us examine the effect of combining harmonics (Figure 2.3). Let
us say that the first harmonic has an amplitude of one. The amplitude
of a waveform is its maximum deviation from rest. The amplitude of
a harmonic is the height of the crests. The second harmonic has an
amplitude of a half. The third harmonic has an amplitude of one and
a delay of half a period. If we add these harmonics up, we get a new
compound waveform. The process of addition mimics what happens in the
real world when waves meet. They simply ride on top of one another.
The terminology used in physics is to say that the waves superpose.
Clearly, adding waveforms is easy. Reversing the process is much
more complicated. How, given a compound waveform, can the am-
plitudes and delays of the constituent harmonics be determined? The
answer is by means of the Fourier transform (FT).
The FT takes any waveform as input and breaks it up into its com-
ponent harmonics. Its output is the amplitude and delay of every
harmonic in the original waveform. For example, given the compound
OUP CORRECTED PROOF – FINAL, 14/7/2020, SPi
waveform in Figure 2.3, the FT will output the amplitudes and delays of
the three constituent harmonics. The output of the FT is two sequences,
one of which lists the amplitudes of the harmonics. For the compound
waveform in Figure 2.3, the amplitudes of the constituent harmonics
are [1, 12 , 1]. The first entry is the amplitude of the first harmonic, and so
on. The second output sequence is the delays of the harmonics: [0, 0, 12 ]
measured in periods.
While initially only of interest to physicists, the real power of the
FT became evident in the decades after the invention of the computer.
Computers allow all kinds of waveforms to be quickly and cheaply
analysed.
A computer stores a waveform as a list of numbers (Figure 2.4). Every
number indicates the level of the waveform at a particular moment
in time. Large positive values are associated with the peaks of the
waveform. Large negative numbers with the troughs. These numbers
are called samples since the computer takes a ‘sample’ of the level the
waveform at regular intervals of time. If the samples are taken often
enough, the list of numbers gives a reasonable approximation to the
shape of the waveform.
The FT commences by generating the first harmonic. The generated
waveform contains a single period—one crest and one trough—and is
the same length as the input sequence. The algorithm multiplies the
two lists of numbers sample-by-sample (e.g. [1, −2, 3] × [10, 11, 12] =
[10, −22, 36]). These results are then totalled. The total is the correlation
of the input and the first harmonic. The correlation is a measure of the
similarity of two waveforms. A high correlation indicates that the first
harmonic is strong in the input.
The algorithm repeats the correlation procedure for a copy of the
first harmonic delayed by a quarter of a period. This time, the corre-
Figure 2.4 A waveform and the associated sample values that are used to
represent the signal.
OUP CORRECTED PROOF – FINAL, 14/7/2020, SPi
36 Ever-Expanding Circles
lation measures the similarity between the input waveform and the
delayed first harmonic.
The two correlation values (not delayed and delayed) are fused to
produce an estimate of the amplitude and delay of the first harmonic.
The amplitude is equal to the square root of the sum of the squares of
the two correlations divided by the number of samples. The delay is
obtained by calculating the relative strength of the two correlations.
The relative strength indicates how close the component is in time to
the two versions of the harmonic.
This double correlation (not delayed and delayed) and fusion process
is repeated for all higher harmonics. This gives their amplitudes and
delays.
In summary, the FT algorithm works as follows:
3
Computer Dreams
The most obvious and the most distinctive features of the History
of Civilization, during the last fifty years, is the wonderful increase
of industrial production by the application of machinery, the
improvement of old technical processes and the invention of
new ones.
Thomas Henry Huxley
The Advance of Science in the Last Half-Century, 1887 25
40 Computer Dreams
to produce cloth with different weaves. The machine wove fabric ac-
cording to the pattern of holes cut into cards. By altering the card
punch pattern, a different weave could be produced. Sequences of cards
were linked together in loops such that the machine could repeat the
programmed pattern.
Thus, by the early nineteenth century, Europe was in possession of
hand-powered calculators and steam-powered programmable looms.
An English mathematician with a childhood fascination for machines
began to wonder about combining these concepts. Surely, a steam-
powered programmable machine could perform calculations much
faster, and more reliably, than any human? His idea almost changed
the world.
A Clockwork Computer
Charles Babbage was born in England in Walworth, Surrey in 1791
(Figure 3.1). The son of a wealthy banker, Babbage became a student
Figure 3.1 Charles Babbage (left, c. 1871) and Ada Lovelace (right, 1836),
the first programmers. (Left: Retrieved from the Library of Congress, www.loc.gov/item/
2003680395/. Right: Government Art Collection. GAC 2172, Margaret Sarah Carpenter:
(Augusta) Ada King, Countess of Lovelace (1815–1852) Mathematician; Daughter of Lord
Byron.)
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
A Clockwork Computer 41
42 Computer Dreams
represented by the positions of gears, cogs, and levers. The engine would
be capable of automatically performing a sequence of calculations,
storing, and re-using intermediate results along the way. The machine
was designed to perform a single, fixed algorithm. Hence, it lacked
programmability. Nevertheless, the design was a significant advance on
previous calculators, which required manual entry of each and every
number and operation. Babbage fabricated a small working model of
the machine. Seeing merit in Babbage’s concept, the British government
agreed to fund construction of Babbage’s Difference Engine.
Building the Engine proved challenging. Tiny inaccuracies in fab-
rication of its components made the machine unreliable. Despite re-
peated investment by the government, Babbage and his assistant Jack
Clement only completed a portion of the machine before construction
was abandoned. In total, the British Treasury spent nearly £17,500 on
the project. No small sum, the amount was sufficient to otherwise
procure twenty-two brand new railway locomotives from Mr. Robert
Stephenson.26
Despite the failure of his Difference Engine project, Babbage was still
drawn to the idea of automated calculation. He designed a new, much
more advanced machine. The Analytic Engine was to be mechanical,
steam-driven, and decimal. It would also be fully programmable. Bor-
rowing from Jacquard’s loom, the new machine would read instruc-
tions and data from punch cards. Likewise, results would be proffered on
punched cards. The Analytic Engine was to be the first general-purpose
computer.
Once again, Babbage appealed to the government for funding. This
time, the money was not forthcoming. The Analytic Engine project
stalled.
Babbage made his only public presentation on the Analytic Engine to
a group of mathematicians and engineers in Torino, Italy. One of the
attendees—Luigi Federico Menabrea—a military engineer, made notes
and, subsequently, with the help of Babbage, published a paper on the
device. That paper was in French. Another supporter of Babbage’s—
Ada Lovelace—greatly admired the work and resolved to translate it
into English.
Ada Lovelace (born Augusta Ada Byron) was born in 1815, the
daughter of Lord and Lady Byron (Figure 3.1). Lady Byron (Anne
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
A Clockwork Computer 43
44 Computer Dreams
46 Computer Dreams
in England in the care of a retired army colonel and his wife. It was
several years before their mother rejoined her children in England.
The period of familial co-habitation proved short. Turing was sent to
boarding school at thirteen.
At school, Turing befriended classmate Christopher Morcom. The
two shared a deep interest in science and mathematics. They passed
notes discussing puzzles and proofs back and forth in class. Turing
came to worship the ground upon which Morcom walked. Tragically,
Morcom died from tuberculosis in 1930. Turing was deeply affected by
the loss.
In her memoirs, Turing’s mother, Ethel Sara Turing (née Stoney),
recalled her adult son with fondness: 32
He could be abstracted and dreamy, absorbed in his own thoughts, which
on occasion made him seem unsociable [ …]. There were times when his
shyness led him into extreme gaucherie.
Some did not share his mother’s sympathy and saw him as a loner. One
of his lecturers was to speculate that Turing’s isolation and his insistence
on working things out from first principles bestowed a rare freshness on
his work. Perhaps because of his brilliance, Turing did not tolerate fools
lightly. He was also prone to eccentricity, practicing his lectures in front
of his teddy bear, Porgy, and chaining his mug to a radiator to prevent
theft. Turing was a rare combination of difficult to get on with but well-
liked by many of his peers.
Turing won a scholarship to study at the University of Cambridge
and graduated with a first-class honour’s degree in Mathematics. In
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
48 Computer Dreams
2. The tape can be moved left or right by a single cell, or remain still.
3. The state in memory can be updated or left unchanged
Turing envisaged that a human programmer would write programs
for the machine to execute. The programmer would provide the pro-
gram and input data to an associate who would manually operate
the machine. With hindsight, it is easy to see that processing the in-
structions is so straightforward that the human operator could be
replaced by a mechanical or electronic device. The operator performs
the following algorithm:
replaces the symbols ‘2+2’ on the tape with the symbol ‘4’. In a modern
computer, arithmetic operations are built-in so as to increase processing
speed.
Turing proposed that his machine was flexible enough to perform
any algorithm. His proposal, which is now generally accepted, is a two-
sided coin. It defines what an algorithm, and a general-purpose com-
puter, are. An algorithm is a sequence of steps that a Turing machine
can be programmed to perform. A general-purpose computer is any
machine that can execute programs equivalent to those which can be
performed by a Turing machine.
Nowadays, the mark of a general-purpose computer is that it is Turing
complete. In other words, it can mimic the operation of a Turing machine.
The paper tape symbols can, of course, be substituted by other physical
quantities, e.g. the electronic voltage levels used in modern computers.
All modern computers are Turing complete. If they weren’t Turing
complete, they wouldn’t be general-purpose computers.
An essential feature of the Turing machine is its ability to inspect
data and to make decisions about what action to perform next. It is this
capability that raises the computer above the automatic calculator. A
calculator cannot make decisions. It can process data, but not respond
to it. Decision-making capability gives computers the power to perform
algorithms.
Turing used his hypothetical machine to assist him in tackling a
classic problem in computability: ‘Can all functions be calculated using
algorithms?’ A function takes an input and produces a value as output.
Multiplication is a computable function, meaning that there is a known
algorithm for calculating the result of a multiplication for all possible
input values. The question that Turing was wrestling with was: ‘Are all
functions computable?’.
He proved that the answer is ‘no’. There are some functions that are
not computable by means of an algorithm. He demonstrated that one
particular function is not computable.
The halting problem queries if there is an algorithm that can always
determine if another algorithm will terminate? The halting problem ex-
presses a practical difficulty in programming. By mistake, programmers
can easily write a program in which some of the steps repeat forever.
This circumstance is called an infinite loop. Typically, it is not desirable
that a program never terminates. It would be helpful for programmers
to have a checker program that would analyse a newly written program
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
50 Computer Dreams
52 Computer Dreams
54 Computer Dreams
hindered by the war. Lack of parts, limited funds, and aerial bombing
all played their part in slowing development of the Z series. By 1945,
Zuse’s work was practically at a standstill. Development of the Turing
complete Z4 computer was fatefully stalled. It would be 1949 before
Zuse re-established a company to manufacture his devices. The Z4 was
finally delivered to ETH Zurich in 1950. Zuse’s company built over 200
computers before it was acquired by German electronics conglomerate
Siemens.
In the US, the Second World War was a boon to the development of
the first computers. Inspired by a demonstration model of Babbage’s
Difference Engine, Howard Aiken designed an electronic computer at
Harvard University. Funded and built by IBM, the Harvard Mark I (AKA
the Automatic Sequence Controlled Calculator) was delivered in 1944.
Fully programmable and automatic, the machine could run for days
without intervention. However, it lacked decision-making capability,
and as such, the Harvard Mark I was not Turing Complete and, hence,
not a general-purpose computer.
The world’s first operational digital general-purpose computer was
constructed in Pennsylvania. Bankrolled by the US Army, and with
a clear military application, the machine was unveiled to the assem-
bled press in 1946. It was the beginning of a revolution in algorithm
development.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
4
Weather Forecasts
56 Weather Forecasts
Numerical Forecasts
Born in Newcastle, England in 1881, Lewis Fry Richardson (Figure 4.1)
studied science at Newcastle University and King’s College, Cambridge.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Numerical Forecasts 57
58 Weather Forecasts
ENIAC 59
Richardson’s book was not well received. His algorithm was wildly
inaccurate and outlandishly impractical. It necessitated a vast amount
of computation. The only way forward was with the assistance of a
high-speed calculating machine. It would be almost thirty years before
numerical weather forecasting was revisited.
ENIAC
The first operational general-purpose computer was designed and built
at the University of Pennsylvania during the Second World War. The
ENIAC (Electronic Numerical Integrator And Computer) was designed
by John Mauchly and Presper Eckert, two professors at the college. In a
twist of fate, most of the credit went to world-famous mathematician,
John von Neumann.
Mauchly was born in Cincinnati in 1907. Such was his ability, he was
allowed to begin his PhD studies in Physics before completing the more
basic BSc degree. On graduation, Mauchly was appointed as a Lecturer
at Ursinus College in Pennsylvania.
In 1941, Mauchly took a course on Electronic Engineering at the
Moore School in the University of Pennsylvania. Sponsored by the US
Navy, the course focused on electronics for the military. Eckert, a recent
graduate of the Moore School, was one of the instructors on the course.
Although Eckert hadn’t been an ace student, he was a superb practical
engineer. Despite Mauchly being Eckert’s senior by twelve years, the
two hit it off, bonding over a shared fascination with gadgets. After the
course, Mauchly was hired by the Moore School.
In the shadow of the Second World War, the Moore School was host
to human computers working for the US Army. These human comput-
ers were employed by the Ballistic Research Laboratory (BRL) situated
at the nearby Aberdeen Proving Ground in Maryland. The team, as-
sisted by the Moore School’s mechanical calculators, produced ballistics
tables for the artillery. The tables were used by gunnery officers on
the battlefield to determine the correct firing angles for their artillery
pieces. The tables allowed an officer to take into account the equipment
type, air pressure, wind velocity, wind direction, target range, and target
altitude. The BRL employed one hundred female graduate mathe-
maticians to perform the exacting and time-consuming calculations.
Even with that workforce, the Laboratory was unable to keep up with
demand.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
60 Weather Forecasts
ENIAC 61
Figure 4.2 The ENIAC team, 1946. Left to right: Homer Spence; J. Presper
Eckert, chief engineer; Dr John W. Mauchly, consulting engineer; Elizabeth
Jennings (aka Betty Jean Jennings Bartik); Capt. Herman H. Goldstine, liaison
officer; Ruth Lichterman.
62 Weather Forecasts
was the only genius I ever met. The others were super-smart and great
prima donnas. But von Neumann’s mind was all-encompassing. He could
solve problems in any domain and his mind was always working, always
restless.
With a warm and friendly personality, von Neumann was well liked.
Everyone knew him as ‘Johnny’. He had the humility to listen and the
ability to absorb ideas. Given to sharp suits and fast cars, Johnny was
possessed of an earthy sense of humour. The great intellectual delighted
in people and gossip.
During the Second World War, von Neumann was granted a leave
of absence from Princeton to contribute to military projects. He was
heavily involved in the Manhattan Project, assisting in the design of
the first atomic bomb. The Project demanded a great many calcula-
tions. Von Neumann saw the need for a machine that could calculate
faster than any human. In 1944, von Neumann met Goldstine, by
chance it seems, on a train station platform in Aberdeen, Maryland.
Goldstine introduced himself. The two got to talking and, perhaps to
impress von Neumann, Goldstine mentioned his work on ENIAC. Von
Neumann’s interest was piqued. Goldstine extended an invitation and,
subsequently, von Neumann joined the ENIAC project as a consultant.
Eckert later said: 47
Von Neumann grasped what we were doing quite quickly.
In June 1945, von Neumann wrote a 101-page report entitled First Draft
of a Report on the EDVAC. The report described the new EDVAC design
in detail but neglected to mention Mauchly and Eckert, the machine’s
inventors. With Goldstine’s approval, the report was distributed to
people associated with the project. Von Neumann, sole author of the
first report on EDVAC, was widely seen as the originator of the design.
Eckert complained: 47
I didn’t know he was going to go out and more or less claim it as his own.
He not only did that, but he did it at the time when the material was
classified, and I was not allowed to go out and make speeches about it.
Completed in 1945, ENIAC arrived too late to assist in the war effort.
The giant machine was unveiled to the public on St Valentine’s Day 1946
at a press conference in the Moore School. One of the team, Arthur
Burks, gave a demonstration of ENIAC’s capabilities. He started the
show by adding 5,000 numbers together in one second. Next, Burks
explained that an artillery shell takes thirty seconds to travel from gun
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
ENIAC 63
64 Weather Forecasts
Monte Carlo
Stanislaw Ulam (Figure 4.3) was born to a well-off Polish-Jewish family
in 1909. He studied mathematics, graduating with a PhD from the
Lviv Polytechnic Institute in the Ukraine. In 1935, he met John von
Neumann in Warsaw. Von Neumann invited Ulam to work with him for
a few months at the IAS in Princeton. Soon after joining von Neumann,
Ulam procured a lecturing job at Harvard University in Boston. He
moved permanently to the US in 1939, narrowly avoiding the outbreak
of the Second World War in Europe. Two years later, he became a citizen
of the United States. Ulam forged a reputation as a talented mathe-
matician and, in 1943, was invited to join the Manhattan Project in Los
Alamos, New Mexico. The high-powered, collaborative environment at
Los Alamos suited Ulam. Nicholas Metropolis, a Los Alamos colleague,
later wrote of Ulam: 52
His was an informal nature; he would drop in casually, without the
usual amenities. He preferred to chat, more or less at leisure, rather
than to dissertate. Topics would range over mathematics, physics, world
events, local news, games of chance, quotes from the classics—all treated
somewhat episodically but always with a meaningful point. His was a
mind ready to provide a critical link.
In the push to build the bomb, Ulam was assigned the problem of cal-
culating the distance that neutrons (charge-free particles at the centre
of an atom) travel through a shielding material. The problem seemed
intractable. Neutron penetration depends on the particle’s trajectory
and the arrangement of atoms in the shielding material. Imagine a table
tennis ball carelessly launched at a million skittles placed at random.
How far does the ball travel, on average? There are so many possible
paths, how could anyone answer the question?
While in hospital convalescing from an illness, Ulam took to playing
Canfield Solitaire, a single-player card game. Canfield Solitaire uses the
normal deck of fifty-two playing cards. Cards are dealt one by one and
moved between piles according to the rules of the game and the player’s
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Monte Carlo 65
Figure 4.3 Stanislaw Ulam, inventor of the Monte Carlo method, c. 1945. (By
Los Alamos National laboratory. See Permissions.)
decisions. The goal is to end up with just four piles of cards. Each pile
should contain all of the cards from one suit.
The rules are quite simple. When a card is drawn, there is a small
number of legal moves to choose from. In most cases, selecting the best
move is straight-forward.
Ulam wondered, what were his chances of winning a game? Whether
he won or lost depended on the order in which the cards were dealt.
Some sequences of cards lead to a win, others a loss. One way to
calculate the odds was to list all possible card sequences and count the
percentage that lead to wins.
Since a deck contains fifty-two cards, there are fifty-two possible first
cards (Figure 4.4). After that, there are fifty-one cards in the pack, so
there are fifty-one possible second cards. Thus, the number of possible
first and second card sequences is 52 × 51 = 2,652. Extending this calcu-
lation to the whole pack, gives 52 × 51 × 50 × 49 × · · · × 1. This is equal
to an 8 followed by sixty-seven digits. No one could possibly play that
many games.
Ulam wondered if the problem could be simplified. What if he played
just ten games? He could count the percentage of wins. That would give
him an indication of the true probability of winning. Of course, with just
ten games there is the possibility of a lucky streak. This would distort
the odds. What about one hundred games? A lucky streak that long
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
66 Weather Forecasts
Figure 4.4 Possible card deals in Canfield Solitaire. The circled outcomes are
sampled using the Monte Carlo method.
Ulam figured that his algorithm would work for more than just
card games. It would also work for the neutron diffusion problem.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Computer Forecasts 67
Computer Forecasts
After the War, von Neumann returned to academic life at the Institute
for Advanced Study in Princeton (Figure 4.5). There, he initiated a
project to build a new electronic computer along the lines of the ED-
VAC. The IAS computer, as it became known, was to be von Neumann’s
gift to computing. The machine was operational from 1952 until 1958.
More importantly, von Neumann distributed plans for the IAS machine
to an array of research groups and corporations. The IAS machine
became the blueprint for computers all over the world.
In addition, Von Neumann pondered the tasks that computers might
be put to. Perhaps as a result of his work on fluid flow in the 1930s, von
Neumann seems to have been aware of Richardson’s work on numerical
weather forecasts. Inspired, von Neumann secured a grant from the
US Navy for establishment of the first computer meteorology research
group. To kick start the initiative, von Neumann organized a conference
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
68 Weather Forecasts
Figure 4.5 Mathematician John von Neumann with the IAS computer, 1952.
(Photograph by Alan Richards. Courtesy of the Shelby White and Leon Levy Archives Center at
the Institute for Advanced Study, Princeton, NJ, USA.)
Chaos 69
Chaos
Edward Lorenz was born in Connecticut in 1917. He studied mathemat-
ics prior to serving as a meteorologist in the US Army Air Corps. He
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
70 Weather Forecasts
Chaos 71
72 Weather Forecasts
FLORENCE
Figure 4.6 Hurricane Florence (2018) path predictions obtained using an en-
semble method.
Long-Range Forecasts
Like ENIAC, the computers of the 1950s were expensive, power hungry,
and unreliable behemoths. The invention of the transistor and the inte-
grated circuit, in 1947 and 1958 respectively, allowed the miniaturization
of the computer.
Transistors are electronic switches. Containing no moving parts,
other than electrons, transistors are small, low power, reliable, and
incredibly fast. Groups of transistors can be wired together to create
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Long-Range Forecasts 73
5
Artificial Intelligence Emerges
In the 1940s and 1950s, a computer was considered to be, in essence, a fast
calculator. Due its high cost and large size, a computer was a centralized,
shared resource. Mainframe computers churned through huge volumes
of repetitive arithmetic calculations. Human operators were employed
as gatekeepers to these new-fangled contraptions, apportioning valu-
able compute time to competing clients. Mainframes ran substantial
data processing jobs one after another with no user interaction. The
final voluminous printouts were presented in batch by the operators to
their grateful clients.
Amidst the expansion of industrial-scale arithmetic, a handful of
visionaries wondered if computers could do more. These few under-
stood that computers were fundamentally symbol manipulators. The
symbols could represent any sort of information. Furthermore, they
opined that, if the symbols were manipulated correctly, a computer
might even perform tasks which had, until that point, required human
intelligence.
departure, the team pressed on. A much simplified design, the Pilot
ACE, finally became operational in 1950.
That autumn, the group received an unusual request. Christopher
Strachey, a teacher at Harrow School, enquired if he might have a go at
programming the Pilot ACE. Strachey was undoubtedly a novice to pro-
gramming, although in 1950, everyone was a novice to programming.
Born in 1916, Strachey was scion of a well-to-do, intellectual, En-
glish family. He graduated King’s College, Cambridge with a degree
in Physics. In his third year, he suffered a mental breakdown. Later,
his sister attributed the collapse to Strachey coming to terms with his
homosexuality. 68 During the Second World War, Strachey worked on
radar development. Thereafter, he took up employment as a teacher at
Harrow, one of the most exclusive public schools in England.
Strachey’s request was approved and he spent a day of his Christmas
vacation at the NPL, absorbing all the information that he possibly could
about the new machine. Back at Harrow, Strachey took to writing a
program for the Pilot ACE. With no machine at his disposal, Strachey
wrote the program with pen and paper and tested it by imaging the
computer’s actions. Most beginners start with a simple programming
task. Due to either ambition or naivety, Strachey embarked on writing a
program to play Checkers (Draughts in the UK). This was certainly not
an arithmetic exercise. Playing Checkers mandates logical reasoning
and foresight. In other words, playing Checkers requires intelligence.
That spring, Strachey got wind of a new computer at Manchester
University. The project was initiated by Bletchley Park alumnus Max
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Newman just after the war. More powerful than the ACE Pilot, the
Manchester Baby seemed better suited to Strachey’s work. Strachey got
in touch with Turing, who by then was Deputy Director of the Manch-
ester Computing Machine Laboratory. Acquaintances since King’s Col-
lege days, Strachey managed to wheedle a copy of the programming
manual from Turing. Later that summer, Strachey visited Turing to find
out more.
A few months later, Strachey returned to test a program he had
written at Turing’s behest. Overnight, Strachey went from handwritten
notes to a working thousand-line program. The program solved the
problem that Turing had set and, on completion, played The National
Anthem on the computer’s sounder. This was the first music ever played
by a computer. Even Turing was impressed. It was clear that Strachey
was a born programmer.
Strachey was recruited by the National Research and Development
Corporation (NRDC). The NRDC’s remit was to transfer new tech-
nologies from government agencies to the private sector. The NDRC
didn’t have much for Strachey to do at the time, so he continued
programming, and among other things, he invented a program to
compose love letters.
Strachey’s program took a template love letter as input and selected
the adjectives, verbs, adverbs, and nouns at random from pre-stored
lists. From whence came the ardent epistle: 72
Honey Dear
My sympathetic affection beautifully attracts your affectionate enthusi-
asm. You are my loving adoration: my breathless adoration. My fellow
feeling breathlessly hopes for your dear eagerness. My lovesick adoration
cherishes your avid ardour.
Yours wistfully
M. U. C. [Manchester University Computer]
To the bemusement of his colleagues, Strachey pinned the love letters
on the Laboratory noticeboard. While whimsical in nature, Strachey’s
program was the first glimmer of computer creativity.
Strachey finally completed his Checkers program in 1952, describing
it in a paper entitled Logical or Non-Mathematical Programmes.
Checkers is a two-player board game played on the same eight-by-
eight grid as Chess. Players take opposing sides of the board and are given
twelve checkers (disks) each. One player plays white, the other black. To
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 5.2 Checkers boards illustrating a simple play by white (left) and a later
jump which removes a black checker (right).
begin, the checkers are placed on black squares in the three rows nearest
the player (Figure 5.2). Players take turns to move a single checker.
Checkers normally move one square diagonally in a forwards direction.
Checkers can jump over an opponent’s neighbouring checker if the
square beyond is unoccupied. A sequence of such jumps can be per-
formed in a single play. All of the opponent’s ‘jumped’ checkers are
removed from the board. The aim of the game is to eliminate all of an
opponent’s checkers. At the start, checkers can only move forwards.
When a piece reaches the far side of the board, it is ‘crowned’ by placing a
checker on top of it. Crowned pieces can be moved diagonally forwards
or backwards.
Checkers is complex. There is no simple strategy that inevitably
leads to a win. Potential plays must be evaluated by imagining how the
game will evolve. A seemingly innocuous play can have unforeseen
repercussions.
Strachey’s algorithm uses numbers to record the position of the
checkers on the board. On its own turn, the algorithm examines all
possible next plays (ten on average). In board game parlance, a move
consists of two plays, one by each player. A single play (or half-move)
is called a ply. For every possible next play, the algorithm assesses its
opponent’s potential responses. This lookahead procedure is applied up
to three moves deep. The results of the lookahead can be visualized as a
tree (Figure 5.3). Every board position is a node, or branching point, in the
tree. Every possible play from that position gives rise to a branch leading
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 5.3 Visualization of the Checkers lookahead tree. Every node is a board
position. Every branch is a play.
to the next board position. The greater the lookahead, the more layers
in the tree. For the nodes at the end of the lookahead, the algorithm
counts the number of checkers that each player retains on the board.
It selects the play at the root of the tree that leads to the greatest
numerical advantage for the computer at the end of the lookahead.
On the Ferranti Mark I, a commercial version of the Manchester
Mark I, every play took one to two minutes of computer time. Even
at that, Strachey’s program wasn’t a particularly good player. With
hindsight, the program’s lookahead depth was insufficient, the decision-
making logic lacked sophistication, and position evaluation was inac-
curate. Nevertheless, the hegemony of computer arithmetic had been
broken. Here was the first working example of artificial intelligence.
Strachey went on to become the University of Oxford’s first Professor
of Computer Science. Unfortunately, while a well-regarded academic,
much of Strachey’s later work remains unrecognized due to his hesi-
tancy to publish academic papers. After a short illness, he passed away
in 1975, aged 58.
Board games have become the barometers of artificial intelligence
(AI). The reasons why are both technical and very human. Board games
have clearly defined goals and rules that are amenable to computer
programming. At any point in time, there is a limited number of options
available to the computer, which makes the problem tractable. Playing
humans is an easily understood indictor of progress. Additionally peo-
ple love contests. A machine that might beat the world champion will
always provoke public interest. Even AI researchers crave an audience.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Machine Reasoning
Algebra is the branch of mathematics concerned with equations that
include unknown values. The unknowns are signified by letters. Using
the rules of algebra, mathematicians seek to re-arrange and combine
equations so that the values of the unknowns can be determined.
For thousands of years, manipulating equations was the domain of
mathematicians. This was something that abaci, calculators, and early
computer programs could not do. In 1946, John Mauchly, co-inventor
of ENIAC wrote: 51
I might point out that a calculating machine doesn’t know how to do
algebra, but only arithmetic.
By the time of the Dartmouth Conference, Newell and Simon were
working at RAND Corporation. Based in Santa Monica, California,
RAND was, and still is, a not-for-profit research institute. Established
after the Second World War, RAND specializes in planning, policy, and
decision-making research for governmental bodies and corporations.
In the 1950s, RAND’s number one customer was the US Air Force.
RAND was a researchers’ paradise—intellectual freedom, smart col-
leagues, healthy budgets, and no teaching. Effectively, RAND employ-
ees were told: 45
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Machine Reasoning 83
Figure 5.4 Designers of the Logic Theorist, Allen Newell and Herbert Simon.
(Courtesy Carneige Mellon University.)
Here’s a bag of money, go off and spend it in the best interests of the Air
Force.
Simon, the elder of the pair by eleven years, was from Milwaukee.
By the 1950s, he was an established political scientist and economist. He
was a member of faculty at Carnegie Institute of Technology (CIT) in
Pittsburgh and spent summers working at RAND.
Newell grew up in San Francisco, California. He graduated with a
degree in Physics from Stanford University before dropping out of an
advanced degree in Mathematics at Princeton to join RAND.
The pair first dabbled in computers while working on projects with
the goal of enhancing organizational efficiency in air defence centres.
RAND’s computer, JOHNNIAC, was based on the IAS blueprint. John
von Neumann himself was a guest lecturer at RAND. Nonetheless, it
was a talk by Oliver Selfridge of MIT Lincoln Labs that captured Newell’s
imagination. At it, Selfridge described his work on recognizing simple
letters (Xs and Os) in images. Newell later reflected: 45
[The talk] turned my life. I mean that was a point at which I started work-
ing on artificial intelligence. Very clear—it all happened one afternoon.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Over the course of the next year, Newell and Simon developed an AI
program named Logic Theorist. Newell relocated from Santa Monica
to Pittsburgh so as to work more closely with Simon in CIT. Since CIT
didn’t have a computer, the duo tested their program by gathering a
team of students in a classroom and asking them to simulate the be-
haviour of the machine. The group ‘walked’ through programs, calling
out instructions and data updates along the way. After verification,
Simon and Newell transferred the program to Cliff Shaw in RAND
Santa Monica via teletype. Shaw entered the program into JOHNNIAC
and sent the results back to Pittsburgh for analysis.
The team declared the Logic Theorist operational on 15 December
1955. When the teaching term resumed, Simon was triumphant. He
announced: 45
Over Christmas, Allen Newell and I invented a thinking machine.
Logic Theorist performs algebra on logic equations. A logic equation re-
lates variables to one another by means of operators. Variables are denoted
by letters and can have either true or false values. The most common
logical operations are: ‘=’ equals, ‘AND’, and ‘OR’. For example, if we
allocate the following meanings to the variables A, B, and W:
A = ‘Today is Saturday’
B = ‘Today is Sunday’
W = ‘Today is the weekend’
we can construct the equation:
W = A X OR B
meaning that ‘Today is the weekend’ is true if ‘Today is Saturday’ is true
OR ‘Today is Sunday’ is true, excluding the case that both are true.
By means of algebra, equations such as this can be manipulated so
as to reveal new relationships between variables. The series of manip-
ulations that lead from an initial set of equations to a conclusion is
called a proof. The idea is that if the initial set of equations is valid and
the rules of manipulation have been applied properly, the conclusion
must also be valid. The starting equations are called the premises and the
final conclusion the deduction. The proof provides formal step-by-step
evidence that the deduction is valid given the premise. For example,
given:
W = A X OR B
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Machine Reasoning 85
Three years later, Simon received the Nobel Prize for contributions to
microeconomics, his other research interest. Newell and Simon lived
the rest of their lives in Pittsburgh. Newell passed away in 1992 (aged
sixty-five) and Simon in 2001 (aged eighty-four).
Machine Learning
The ability to learn is central to human intelligence. In contrast, early
computers could only store and retrieve data. Learning is something
entirely different. Learning is the ability to improve behaviour based
on experience. A child learns to walk by copying adults and by trial-
and-error. Unsteady at first, a toddler’s co-ordination and locomotion
gradually improve until the infant becomes a proficient walker.
The first computer program to display the ability of learn was un-
veiled on public television on 24 February 1956. That program was
written by Arthur Samuel, of IBM. Samuel’s program, like Strachey’s,
played Checkers. The TV demo was so impressive that it was credited
with a fifteen-point uptick in IBM’s share price the next day.
Samuel was born in Kansas in 1901. He received a Master’s degree
in Electronic Engineering from MIT prior to taking up employment
with Bell Labs. After the Second World War, he joined the University of
Illinois as a Professor. Even though the university lacked a computer,
Samuel started work on a Checkers-playing algorithm. Three years
later, after joining IBM, Samuel finally got his hands on a real computer.
At much the same time as Strachey published his paper on Checkers,
Samuel got the first versions of his game-playing program working.
On first sight of Strachey’s paper, Samuel felt that his own work had
been scooped. On closer inspection, it was clear that Strachey’s program
was a weak Checkers player. Confident that he could do better, Samuel
pressed on.
In 1959, Samuel finally published a paper describing his new Checkers
program. The understated title—Some Studies in Machine Learning using the
Game of Checkers—belied the importance of his ideas.
Samuel’s algorithm is more thorough in evaluating positions than
Strachey’s. It achieves this by means of a clever scoring algorithm.
Points are given for various on-board features. A feature is anything that
indicates the strength, or weakness, of a position. One feature is the
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Machine Learning 87
difference in the number of checkers that the two players have on the
board. Another is the number of crowners. Yet another is the relative
positions of checkers. Strategic elements, such as freedom to move or
control of the centre of the board, are also considered to be features.
Points are scored for every feature. The points for a given feature are
multiplied by a weight. The resulting values are totalled to give an overall
score for the position.
The weights determine the relative importance of each feature.
Weights can be positive or negative. A positive weight means that the
feature is beneficial for the computer player. A negative weight means
that the feature reduces the computer’s chances of winning. A large
weight means that a feature has a strong influence on the total score.
Multiple features with low weights can, however, combine to influence
the overall score and thus the final decision.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Machine Learning 89
Image a simple two-ply lookahead (Figure 5.6). The tree includes the
computer’s potential next plays and its opponent’s possible responses.
The scores at the leaves of the tree (ply 2) are inspected to find the
minimum in each sub-tree. This reflects the opponent’s selection of the
best play from their point of view. These minimum scores are copied
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 5.6 Lookahead tree showing backtracking scores obtained using the
minimax algorithm.
to the nodes immediately above (ply 1). This puts the scores 1, 3, 7, 5,
and 6 on the nodes in ply 1. Now, the algorithm selects the play giving
the highest score. This means that the computer chooses the best play
from its point of view. Thus, the maximum value of 7 is copied back to
the root of the tree. The play leading to the board with the score of 7 is
the best choice, provided that the opponent is a good player. This play
forces the opponent into a choice between positions with scores of 8, 7,
and 10. The best that the opponent can do is accept the position with a
score of 7.
To make effective use of the available compute time, Samuel’s pro-
gram adjusts the depth and width of the lookahead search according to
a set of rules (i.e. it uses heuristic search). When a position is unstable, for
example just before a jump, the program looks further ahead. Bad plays
are not explored in depth. Pruning the search in this way affords more
time for evaluation of likely scenarios. To further accelerate processing,
Samuel’s program stores the minimax scores for commonly occurring
board positions. These scores do not need to be recalculated during
execution, as simple table look-up suffices.
In 1962, Samuel’s Checkers playing program was pitted against
Robert Nealey, a blind Checkers master. The computer’s victory was
widely hailed but Nealey wasn’t even a state champion. It would be
thirty years (1994) before a computer program finally defeated the
Checkers world champion.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
The AI Winters 91
The AI Winters
During the late 1950s and 1960s, expectations for AI were sky high.
Supported by Cold War military funding, AI research groups flour-
ished, most notably at MIT, CMU, Stanford University, and Edinburgh
University. In 1958, Newell and Simon predicted that within just ten
years: 83
[A] digital computer will be the world’s Chess champion, unless the rules
bar it from competition.
Four years later, Claude Shannon, founder of information theory, pro-
nounced deadpan to a television camera: 84
I confidently expect that within a matter of ten or fifteen years something
will emerge from the laboratory which is not too far from the robot of
science fiction fame.
In 1968, Marvin Minsky, Head of AI Research at MIT, predicted that: 85
[I]n thirty years we should have machines whose intelligence is compara-
ble to man’s.
Of course, none of these predictions came true.
Why were so many eminent thinkers so spectacularly wrong? The
simplest answer is hubris. These were mathematicians. To them, math-
ematics was the pinnacle of intelligence. If computers could perform
arithmetic, algebra, and logic, then surely more mundane forms of
intelligence must soon yield. What they failed to appreciate was the
variability of the real-world and the complexity of the human brain.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
6
Needles in Haystacks
94 Needles in Haystacks
Figure 6.1 The Travelling Salesman Problem: Find the shortest tour that visits
every city once and returns home.
To start with, the set of all the cities, excluding the home city, is
input to the algorithm. The home city is the known start and end point
of every tour, so it does not need to be included in the search. The
algorithm creates a tree of city visits from the input set (Figure 6.2). The
algorithm relies on two mechanisms. First, it uses repetition—the algo-
rithm selects every city in the input set, one after another, as the next
to be visited. Second, it uses recursion (see Chapter 1). For each city, the
algorithm calls a copy, or instance, of itself. An instance of an algorithm is
another, separate enactment of the algorithm that operates on its own
data. In this case, every instance creates a new sub-tree in the diagram.
After each city is visited, the set of cities input to the next instance
of the algorithm is reduced. Thus, the instances deal with fewer and
fewer cities until there is just one left in the set. When this happens,
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 6.2 Tree showing all possible tours. All tours end in Berlin (not shown).
the tree’s leaf instance terminates, returning a tour containing just one
city. The previous instances of the algorithm take this output and add
the selected cities in reverse order. In this way, the algorithm unwinds,
creating tours as it moves up the tree. Once all of the tours have been
traced back to the root of the tree, the original instance of the algorithm
terminates, and the completed list of tours is output.
As the list of tours is being generated, the length of the tours is
calculated by totalling the city-to-city distances.
The operation of the algorithm can be visualized as an animation.
The algorithm constructs the tree from the root. From there, it grows
the topmost path, one city after another, until the top leaf is reached.
It then backtracks one layer and grows the second leaf. Next, it goes back
two layers, before adding the third and fourth leaves. The algorithm
continues sweeping to and fro until the entire tree has been created. In
the end, the algorithm returns to root and terminates.
In the example, Berlin is selected as the home city and so is excluded
from the input set of {Hamburg, Frankfurt, Munich}. The first instance
of the algorithm selects Hamburg, Frankfurt, and Munich in turn as
the first city. For each of these selections, the algorithm spawns a new
instance to explore a sub-tree. After selecting Hamburg as the first city,
the second instance of the algorithm chooses Frankfurt from the set
{Frankfurt, Munich}. It then creates an third instance to deal with the
remaining city: {Munich}. Since there is only one city:
Munich
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
96 Needles in Haystacks
is returned as the only possible tour. The calling instance then prepends
Frankfurt, producing the tour:
Frankfurt, Munich.
The same instance then explores the alternative branch, giving the tour:
Munich, Frankfurt.
These partial tours are returned to the calling instance which adds its
selection, giving the tours:
Hamburg, Frankfurt, Munich;
Hamburg, Munich, Frankfurt.
The sub-trees starting with Frankfurt and Munich are explored in a
similar way. Finally, the complete list of tours is output and the original
algorithm instance terminated.
An exhaustive search such as this is guaranteed to find the shortest
tour. Unfortunately, brute-force is slow. To figure out just how slow
it is, we need to think about the size of the tree. In the example, the
roadmap contains four cities. The cities are fully connected in that every
city is directly connected to all others. Thus, on leaving Berlin, there are
three possible stops. For each of these stops, there are two possible next
cities since the salesman can’t go back home yet or return to the start.
After the first and second stops, there is only one possible third city.
Expanding this out, gives the number of possible tours as 3 × 2 × 1 = 6,
that is, 3 factorial (3!).
Computing the lengths of six tours is a manageable manual com-
putation. What happens if there are 100 cities? One hundred fully
connected cities would give ninety-nine factorial tours, approximately
9 × 10155 (a 9 with 155 zeroes after it). A modern desktop computer
couldn’t possibly cope with that! For the Travelling Salesman Problem,
exhaustive search is surprisingly slow even for roadmaps of seemingly
moderate size.
While more efficient algorithms have been found, the quickest isn’t
much faster than exhaustive search. The only way to significantly speed
up the search is to accept a compromise. You have to accept that the al-
gorithm might not find the shortest possible tour. To date, the best fast
approximation algorithm is only guaranteed to find a path within forty
per cent of the minimum. Of course, compromises and approximations
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Measuring Complexity 97
are not always acceptable. Sometimes, the shortest possible path must
be found.
Over the years, researchers have experimented with programs that
hunt for the shortest tours of real roadmaps. At the beginning of the
computer age (1954) the largest Travelling Salesman Problem with a
known solution contained just forty-nine cities. Fifty years later, the
largest solved tour contained 24,978 Swedish cities. The current state-
of-the-art challenge is a world map of 1,904,711 cities. The shortest tour
found on that map traverses 7,515,772,212 km. Identified in 2013 by Keld
Helsgaun, no one knows if it is the shortest possible tour or not.
Measuring Complexity
The trouble with the Travelling Salesman Problem is the computational
complexity of the algorithm needed to solve it. Computational com-
plexity is the number of basic operations—memory accesses, additions,
or multiplications—required to perform an algorithm. The more oper-
ations an algorithm requires, the longer it takes to compute. The most
telling aspect is how the number of operations grows as the number of
elements in the input increases (Figure 6.3).
98 Needles in Haystacks
Complexity Classes 99
Complexity Classes
Computational problems are graded according to the complexity of
the fastest known algorithms that solve them (Figure 6.4; Table 6.1).
Problems that can be solved with polynomial time algorithms are
referred to as P problems (polynomial time). P problems are considered
quick to solve. For example, sorting is a P problem.
Short Cuts
The Travelling Salesman Problem is one of many combinatorial optimization
problems, which require that a number of fixed elements be combined
in the best way possible. In the case of the Travelling Salesman Problem,
the fixed elements are the city-to-city distances and ‘the best way possi-
ble’ is the shortest tour. The fixed elements can be arranged in a myriad
of ways. The goal is to find the single, best arrangement.
Practical combinatorial optimization problems abound. How best to
allocate staff to tasks in a large factory? What flight schedule maximizes
revenue for an airline? Which taxi should pick up the next customer
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 6.6 The Routing Finding problem requires finding the shortest route
between two cities. This roadmap shows the distances between the principal
Dutch cities in kilometres.
moving the token to cities that lead away from the final destination.
These cities look promising as they have short links to the city marked
with the token. However, they lead in the wrong direction and are
ultimately eliminated. To remedy this flaw, Peter Hart, Nils Nilsson,
and Bertram Raphael proposed the A* (a-star) algorithm. A* utilizes a
modified distance metric. In Dijkstra’s original algorithm, the metric
is the distance travelled. In A*, the metric is the distance travelled so
far plus the straight-line distance from the current city to the final
destination. Whereas Dijkstra’s algorithm only considers the path so far,
A* estimates the length of the complete route, from start to finish. As
a result, A* is less inclined to visit cities that take the token away from
the destination.
Nowadays, variants of A* are used in every navigation app on the
planet, from sat navs to smartphones. To allow greater accuracy, cities
have since been replaced by road intersections, but the principles remain
the same. As we will see, derivatives of Dijkstra’s algorithm are now used
to route data over the Internet.
In fulfilment of van Wijngaarden’s prophecy, Dijkstra went on to
make a series of significant contributions to the precarious emerging
discipline that became computer science. Most notably, he originated
algorithms for distributed computing, whereby multiple computers coop-
erate to solve highly computationally complex problems. In recogni-
tion of his work, Dijkstra was the 1972 recipient of the ACM Turing
Award.
Stable Marriages
Roadmaps are not the only source of combinatorial optimization prob-
lems. Matching problems seek to pair items in the best way possible.
A classic matching problem is pairing hopeful college applicants with
available places. The challenge is to assign high school graduates to
courses in a way that is both fair and satisfies as many students and
colleges as possible. As with other combinatorial optimization prob-
lems, matching becomes difficult as the number of items increases. Fast
algorithms are mandatory even for moderate numbers of students.
The seminal paper on matching was published by David Gale and
Lloyd Shapley in 1962. The pair struck up a friendship at Princeton
University, New Jersey, where both studied for PhDs in Mathematics.
After Princeton, Gale joined Brown University in Rhode Island,
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Central Park. For the sake of discretion, let’s call them Alex, Ben, Carlos,
Diana, Emma, and Fiona. All know one another other. All are single. All
have matrimony on their minds. When asked, they state the marriage
preferences listed in Table 6.2.
In the first round, Alex proposes to Diana. Diana accepts because
she is currently available. Next, Ben proposes to the popular Diana.
Since Diana prefers Ben to Alex, she ditches the latter and accepts Ben’s
proposal. Carlos also proposes to Diana and gets turned down flat. In
the second round, Alex and Carlos are unattached. Alex proposes to
Emma, number two on his list. Emma is, as yet, without a partner,
so she acquiesces to Alex’s advances. Carlos requests Fiona’s hand in
marriage and she agrees since she is unattached. That’s it. Everyone
is now engaged—Alex & Emma, Ben & Diana, Carlos & Fiona. All of
the marriages are stable. Emma would prefer Ben but she can’t have
him since he would rather couple with Diana, his soon-to-be wife.
Similarly, Alex and Carlos have unrequited feelings for Diana but she
is set to marry the man of her dreams, Ben. Fiona is happily betrothed
to Carlos—her number-one pick—even though he, also, has a crush
on Diana.
As a dating scheme, the Gale–Shapley algorithm is brutal—all those
rejections! Nevertheless, one wonders if humans seeking partners in-
tuitively follow a procedure akin to the Gale–Shapley algorithm. In
real life, explicit proposals and firm replies become surreptitious smiles,
longing gazes, inquiries via friends, and polite refusals. While there are
similarities, there are differences. In truth, preferences evolve over time.
The emotional cost of separation means that individuals are reluctant
to make drastic changes. Despite these contrasts, some online dating
agencies now use the Gale–Shapley algorithm to match clients.
One of the biggest matching exercises in the world is carried out
every year by the National Residency Matching Program (NRMP). The
program pairs medical school graduates with internship opportunities
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Artificial Evolution
In the 1960s, John Holland (Figure 6.7) took a radical approach to
solving combinatorial optimization problems. Uniquely, his algorithm
was four billion years old!
Holland was born in Fort Wayne, Indiana in 1929. Like Dijkstra, Hol-
land caught the computer programming bug while studying physics.
At MIT, he wrote a program for the Whirlwind computer. Funded by
the US Navy and Air Force, Whirlwind was the first real-time computer
to incorporate an on-screen display. The machine was designed to
process radar data and provide early warning of incoming aircraft and
missiles. After a brief sojourn programming at IBM, Holland moved to
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 6.7 Designer of the first genetic algorithms, John Holland. (© Santa Fe
Institute.)
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
7
The Internet
On 4 October 1957, the Soviet Union launched the world’s first artificial
satellite into Earth orbit. Sputnik 1 was a radio transmitter wrapped
up in a metal sphere just 60 cm in diameter. Four radio antennas were
fixed to the globe’s midriff. For twenty-one days, Sputnik broadcast a
distinctive and persistent ‘beep-beep-beep-beep’ signal. The signal was
picked up by radio receivers worldwide as the satellite passed overhead.
Sputnik was a sensation. Suddenly, space was the new frontier and the
Soviet Union had stolen a march on the West.
US President Dwight D. Eisenhower resolved that the US should
never again take second place in the race for technological supremacy.
To this end, Eisenhower established two new governmental agencies. He
charged the National Aeronautics and Space Administration (NASA)
with the exploration and peaceful exploitation of space. Its malevo-
lent sibling—the Advanced Research Projects Agency (ARPA)—was
instructed to fund the development of breakthrough military tech-
nologies. ARPA was to be the intermediary between the military and
research-performing organizations.
The Cold War beckoned.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
ARPANET
Nineteen sixty-two saw the founding of ARPA’s Information Processing
Techniques Office (IPTO). The IPTO’s remit was to fund research and de-
velopment in information technology (i.e. in computers and software).
The Office’s inaugural director was JCR (Joseph Carl Robnett) Licklider
(Figure 7.1).
Licklider (born 1915) hailed from St. Louis. Universally liked, every-
one called him ‘Lick’ for short. Lick never lost his Missouri accent. At
college, he studied an unusual cocktail of physics, mathematics, and
psychology. He graduated with a PhD in Psychoacoustics, the science
of the human perception of sound, from the University of Rochester.
He then worked at Harvard University before joining MIT as an As-
sociate Professor. It was at MIT that Licklider first became interested
in computers. It transpired that he had a talent, even a genius, for
technical problem solving. He carried his new-found passion into his
role at ARPA.
Licklider wrote a series of far-sighted papers proposing new computer
technologies. Man-Computer Symbiosis (1960) suggested that computers
should work more interactively with users, responding to their requests
in real-time rather than by batch print-out. He proposed the creation
of an Intergalactic Computer Network (1963) enabling integrated operation of
multiple computers over great distances. In Libraries of the Future (1965),
he argued that paper books should be replaced by electronic devices
that receive, display, and process information. He jointly published a
paper in 1968 that envisaged using networked computers as person-to-
person communication devices. In a ten-year burst of creativity, Lick-
lider predicted personal computers, the Internet, eBooks, and email. His
imagination far outpaced reality. His writings set out grand visions for
others to chase.
The first step was Project MAC (Multiple Access Computing) at MIT.
Up to that point computers had a single user. Project MAC constructed
a system whereby a single mainframe computer could be shared by
up to thirty users working simultaneously. Each user had their own
dedicated terminal consisting of a keyboard and screen. The computer
switched its attention between the users, giving each the illusion that
they had a single, but less powerful, machine at their disposal.
Two years after Licklider’s departure from ARPA, Bob Taylor was
appointed Director of the IPTO (1965). From Dallas, Taylor (born 1932)
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
ARPANET 119
Figure 7.1 JCR Licklider, computer network visionary. (Courtesy MIT Museum.)
crashed after the second character was received. As a result, the first
message sent on the ARPANET was the inauspicious fragment ‘LO’.
About an hour later, after a system re-start, Kline tried again. This time,
the login worked.
ARPANET was one of the first networks to employ packet-switching.
The technique was invented independently by Paul Baran and Donald
Davies. Baran, a Polish–American electrical engineer, published the idea
in 1964 while working for the RAND Corporation. Davies, a veteran of
Turing’s ACE project, developed similar ideas while working at the NPL
in London. It was Davies that coined the terms ‘packet’ and ‘packet-
switching’ to describe his algorithm. Davies was later part of the team
that built the world’s first packet-switched network—the small-scale
Mark I NPL Network in 1966.
Packet-switching (Figure 7.2) solves the problem of efficient transport
of messages across a network of computers. Imagine a network of
nine computers that are physically interconnected by means of cables
carrying electronic signals. To reduce infrastructure costs, each com-
puter is connected to a small number of others. Directly connected
computers are called neighbours, regardless of how far apart they really
are. Sending a message to a neighbour is straightforward. The message
is encoded as a series of electronic pulses transmitted via the cable to
the receiver. In contrast, sending a message to a computer on the other
side of the network is complicated. The message must be relayed by
the computers in between. Thus, the computers on the network must
cooperate to provide a network-wide communication service.
Before packet-switching, communication networks relied on dedi-
cated end-to-end connections. This approach was common in wired
telephony networks. Let’s say that computer 1 wishes to communicate
with computer 9 (Figure 7.2). In the conventional circuit-switching scheme,
the network sets up a dedicated electrical connection from computer
1 to computer 3, from 3 to 7, and from 7 to 9. For the duration of
the message exchange, all other computers are blocked from sending
messages on these links. Since computers typically send short sporadic
messages, establishing dedicated end-to-end connections in this way
makes poor use of network resources. In contrast, packet-switching
provides efficient use of network links by obviating the need for end-
to-end path reservation.
In packet-switching, a single message is broken up into segments.
Each segment is placed in a packet. The packets are independently
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
ARPANET 121
routed across the network. When all of the packets have been received
at the destination, a copy of the original message is assembled. The
network transfers packets from source to destination in a series of hops.
At every hop, the packet is transmitted over a single link between two
computers. This means that a packet only blocks one link at a time.
Thus, packets from different messages can be interleaved—one after
another—on a single link. There is no need to reserve an entire end-
to-end connection.
The downside is that links can become congested. This happens when
incoming packets destined for a busy outgoing link must be delayed and
queued.
A packet-switched data network is similar to a road network. The
packets comprising a single message are akin to a group of cars making
their way to a destination. The cars don’t need the entire path be
reserved in advance. They simply slot in when the road is free. They may
even take different paths to the destination. Each car wends its own way
through the network as best it can.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
ARPANET 123
Figure 7.3 The routing table for computer 3 listing the packet destination, the
delay via computers 1, 2, 4, 7, and 8, and the ID for the neighbour giving the
shortest route from computer 3 to the destination. Delay is measured in hops.
Internetworking
The first public demonstration of the ARPANET took place at the
International Computer Communication Conference (ICCC) in Wash-
ington in October 1972. Organized by Bob (Robert) Kahn of BBN,
the demo connected forty plus computers. Hundreds of previously
sceptical industry insiders saw the demo and were impressed. Packet-
switching suddenly seemed like a good idea after all.
Kahn was from New York (born 1938). He graduated with MA and
PhD degrees from Princeton before joining BBN. Shortly after the ICCC
demo, he moved to the IPTO to oversee further development of the
network. Kahn reckoned that the next big challenge was not adding
more computers to the ARPANET, but connecting the ARPANET to
other networks.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Internetworking 125
The ARPANET was all a bit samey—the nodes and links used similar
fixed line technologies. Kahn imagined a hyperconnected world in
which messages would travel transparently between devices on all kinds
of networks—wired, radio, satellite, international, mobile, fixed, fast,
slow, simple, and complex. This was a glorious concept, to be sure.
The question was how to make it work? Kahn came up with a concept
he called open-architecture networking. The approach seemed promising but
the devil was in the details. In 1973, Kahn visited the Stanford lab of
ARPANET researcher Vint Cerf and announced: 126
I have a problem.
Vint (Vinton) Cerf was born in New Haven, Connecticut in 1943. He
graduated Stanford University with a BS degree in Mathematics before
joining IBM. A few years later, he chose to enrol in graduate school at
UCLA. It was at UCLA that Cerf began to work on the ARPANET. Here
too he met Kahn for the first time. Shortly after Kahn’s ICCC demo,
Cerf returned to Stanford as a professor.
Working together, Kahn and Cerf (Figure 7.4) published an outline
solution to the inter-networking problem in 1974. In it, they proposed
an overarching protocol that would be common to all networked com-
puters. The protocol defined a set of messages and associated behaviours
that would allow computers to communicate across technologically
Figure 7.4 Inventors of TCP/IP: Vint Cerf (left, 2008) and Robert Kahn (right,
2013). (Left: Courtesy Vint Cerf. Right: By Вени Марковски | Veni Markovski -
Own work / CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=
26207416.)
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Fixing Errors
Communication systems such as the Internet are designed to transfer
exact copies of messages from the transmitter to the receiver. To achieve
this, the data in the packet is converted to an electronic signal that
is sent from the transmitter to the receiver. The destination device
converts the received signal back into data. Often, the received signal
is contaminated by electronic noise. Noise is the general term for any
unwanted signal that corrupts the intended signal. Noise can arise from
natural sources or interference from nearby electronic equipment. If
the noise is strong enough relative to the signal, it can lead to errors
in conversion of the signal back into data. Obviously, errors are not
desirable. An error in a sentence might be tolerated. However, an error
in your bank balance would not be, unless it happened to be in your
favour! For this reason, communications systems incorporate error
detection and correction algorithms.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
6 5 12 12 15 - 52.
This time, the calculated checksum (50) does not match the checksum
at the end of the packet (52). The receiver knows that an error has
occurred.
Checksums are common. For example, the International Standard
Book Number (ISBN) at the front of this book contains a checksum. All
books printed after 1970 have an ISBN that uniquely identities the title.
Current ISBNs are thirteen digits long and the last digit is a check digit.
The check digit allows computers to verify that the ISBN has been typed
in or scanned correctly.
A basic checksum merely detects errors. It is impossible to work out
which number is incorrect. The error could even be in the checksum it-
self. Ironically, in this case, the message itself is correct. Basic checksums
require retransmission of the message to fix an error.
Richard Hamming (Figure 7.5) wondered if checksums could be
modified to provide error correction, as well as detection.
Born in Chicago in 1915, Hamming studied Mathematics, ultimately
attaining the PhD degree from the University of Illinois. As the Sec-
ond World War ended, Hamming joined the Manhattan Project in Los
Alamos. He worked, as he put it, as a ‘computer janitor’, running
calculations on IBM programmable calculators for the nuclear physi-
cists.130 Disillusioned, he moved to Bell Telephone Laboratories in New
Jersey. Bell Labs was the research wing of the burgeoning Bell Telephone
Company founded by Alexander Graham Bell, the inventor of the
telephone. In the late 1940s and 1950s, Bell Labs employed a stellar cast
of communications researchers. Hamming was in his element: 130
As with checksums, the parity bit is sent along with the data word, that
is, the sequence of data bits. To check for errors, the receiver simply
counts the number of 1s. If the final count is even, then it can be as-
sumed that no error occurred. If the count is odd, then, in all likelihood,
one of the bits suffered an error. A 0 has been mistakenly flipped to a 1,
or vice versa. For example, an error in bit two gives the word:
0 0 0 0 0 1 - 0.
This time, there is an odd number of 1 bits, indicating that an error has
occurred.
In this way, a single parity bit allows detection of a single error. If two
bits are in error, then the packet appears valid, but is not. For example:
1 0 0001-0
seems to be correct since there is an even number of 1s. As a conse-
quence, additional parity bits are needed when there is a high error rate.
Hamming devised a clever way to use multiple parity bits to detect
and correct single bit errors. In Hamming’s scheme, every parity bit
protects half of the bits in the word. The trick is that no two parity bits
protect the same data bits. In this way, every data bit is protected by a
unique combination of parity bits. Hence, if an error occurs, its location
can be determined by looking at which parity bits are affected. There can
only be one data bit protected by all of the parity bits showing errors.
Let’s say that a data word containing eleven data bits is to be
transmitted:
1 0 1 0 1 0 1 0 1 0 1.
In Hamming’s scheme, eleven data bits require four parity bits. The
parity bits, whose value is to be determined, are inserted into the data
word at positions which are powers of two (1, 2, 4, and 8). Thus, the
protected word becomes:
? ? 1 ? 0 1 0 ? 1 0 1 0 1 0 1,
where the question marks indicate the future positions of the parity bits.
The first parity bit is calculated over the bits at odd numbered positions
(numbers 1, 3, 5, etc.). As before, the value of the parity bit is selected to
ensure that there is an even number of 1s within the group. Thus, the
first parity bit is set to 1:
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
1 ? 1 ? 0 1 0 ? 1 0 1 0 1 0 1 .
The circles indicate the bits in the parity group. The second parity bit is
calculated over the bits whose positions, when written in binary, have a
1 in the twos column (2, 3, 6, 7, etc.):
1 0 1 ?0 1 0 ?1 0 1 01 0 1 .
The third parity bit is calculated over bits whose positions, in binary,
have a 1 in the fours column (4, 5, 6, 7, 12, etc.):
101 1 0 1 0 ?101 0 1 0 1 .
The fourth parity bit is calculated over bits whose positions, in binary,
have a 1 in the eights column (8, 9, 10, 11, etc.):
1011010 0 1 0 1 0 1 0 1 .
This then is the final protected data word, ready for transmission.
Now, imagine that the protected data word suffers an error at bit
position three:
1 0 0 1 0 1 0 0 1 0 1 0 1 0 1.
The receiver checks the word by counting the number of ones in the
four parity groups:
1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 = 5 ones;
1 0 0 10 1 0 01 0 1 01 0 1 = 3 ones;
100 1 0 1 0 0101 0 1 0 1 = 4 ones;
1001010 0 1 0 1 0 1 0 1 = 4 ones.
The first and second parity groups both show errors (i.e. they have an
odd numbers of ones). In contrast, the third and fourth groups do not
indicate errors (i.e. they have an even numbers of ones). The only data
bit that is in the first and second groups and is not in the third and fourth
groups is bit three. Therefore, the error must be in bit number three.
The error is easily corrected by flipping its value from 0 to 1.
Hamming’s ingenious algorithm allows detection and correction of
single errors at the cost of a small increase in the total number of bits
sent. In the example, four parity bits protect eleven data bits—just a
thirty-six per cent increase in the number of bits. Hamming codes are
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
remarkably simple to generate and check. This makes them ideal for
high-speed processing, as required in computer networks, memory, and
storage systems. Modern communication networks employ a mixture
of Hamming codes, basic checksums, and newer, more complex, error-
correcting codes to ensure the integrity of data transfer. A mistake in
your bank balance is extremely unlikely.
After fifteen years in Bell Labs, Hamming returned to teaching,
taking up a position at the Naval Postgraduate School in Monterey,
California. Hamming received the Turing Award in 1968 for his codes
and other work on numerical analysis. He died in 1998 in Monterey, just
one month after finally retiring.
One of the great flaws of the Internet is that it was not designed
with security in mind. Security has had to be grafted on afterwards,
with mixed results. One of the difficulties is that packets can be easily
read en route by eavesdroppers using electronic devices. Encryption cir-
cumvents eavesdropping by altering a message in such a way that only
the intended recipient can recover the original text. An eavesdropper
might still intercept the altered text, but the scrambled message will be
meaningless.
Until the end of the twentieth century, encryption algorithms were
intended for scenarios in which the encryption method could be agreed
in absolute secrecy before communication took place. One imagines a
queen furtively passing a top-secret codebook to a spy at a clandestine
rendezvous. However, this approach doesn’t translate well to computer
networks. How can two computers secretly exchange a codebook when
all data must be sent over a vulnerable public network? At first, it
seemed that encryption wasn’t at all practical in the brave new world
of the computer network.
Secret Messages
Encryption was used in ancient Mesopotamia, Egypt, Greece, and India.
In most instances, the motivation was secure transmission of military
or political secrets. Julius Caesar employed encryption for important
personal letters. The Caesar Cipher replaces every letter in the original
text with a substitute letter. The substitute letter is a fixed number
of places away from the original in the alphabet. To make patterns
more difficult to spot, Caesar’s Cipher strips away spaces and changes all
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
letters to capitals. For example, a right shift by one place in the alphabet
leads to the following encryption:
Hail Caesar
IBJMDBFTBR.
The As become Bs, the Es change to Fs, and so on. Any Zs would be
replaced by As since the shift wraps around the end of the alphabet.
The encrypted message—the ciphertext—is sent to the receiver. The
receiver recovers the original message—the plaintext—by shifting every
letter one place left in the alphabet. The Bs become As and so on
returning the original ‘HAILCAESAR’ message. Thanks to the patterns
in natural language, the missing spaces are surprisingly easy to infer.
Traditional encryption methods, such as the Caesar Cipher, rely on
an algorithm and a secret key. The key is a piece of information that is
essential for successful encryption and decryption. In the Caesar Cipher,
the key is the shift. The algorithm and the key must be known to the
sender and the intended recipient. Typically, security is maintained by
keeping the key secret.
No encryption scheme is perfect. Given sufficient time and a clever
attack, most codes can be broken. The Caesar Cipher can be attacked by
means of frequency analysis. An attacker counts the number of times
that each letter of the alphabet occurs in the ciphertext. The most
commonly occurring is probably the substitute for the vowel E, since
E is the most common letter in the English language. Once a single
shift is known, the entire message can be decrypted. Almost all codes
have vulnerabilities. The question is, ‘How long does the attack take to
perform?’ If the attack takes an unacceptably long period of time, then
the cipher is secure from a practical point of view.
In computer networks, key distribution is problematic. The only
convenient way to pass a key is via the network. However, the network is
not secure against eavesdropping. Sending a secret key via the Internet
is equivalent to making it public. How could a sender and a receiver
possibility agree on a secure key if all they can do is send public messages?
The question became known as the Key Distribution Problem. The first
glimmer of a solution came from a group working at Stanford in the
early 1970s.
Martin Hellman was born in New York in 1945. He studied Electrical
Engineering at New York University before moving to California to
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
study for his MSc and PhD degrees at Stanford. After stints at IBM and
MIT, Hellman returned to Stanford in 1971 as an Assistant Professor.
Against the advice of his peers, Hellman started to work on the Key
Distribution Problem. Most thought it foolhardy to expect to find
something radically new—something that the well-resourced US Na-
tional Security Agency (NSA) had missed. Hellman was unperturbed.
He wanted to do things differently to everyone else. In 1974, Hellman
was joined in the hunt for a solution by Whitfield Diffie.
From Washington, DC, Diffie (born 1944) held a degree in Mathe-
matics from MIT. After graduation, Diffie worked programming jobs
at MITRE Corporation and his alma mater. However, he was capti-
vated by cryptography. He struck out to conduct his own independent
researches on key distribution. On a visit to IBM’s Thomas J. Watson
Laboratory in upstate New York, he heard about a guy called Hellman
who was working on similar stuff at Stanford. Diffie drove 5,000 miles
across the US to meet the man who shared his passion. A half-hour
afternoon meet-up extended long into the night. A bond was formed.
The duo was joined by PhD student Ralph Merkle. Born in 1952,
Merkle had previously come up with an innovative approach to the
Key Distribution Problem while studying as an undergraduate at the
University of California, Berkeley.
In 1976, Diffie and Hellman published a paper describing one of the
first practical algorithms for public key exchange. The paper was to
revolutionize cryptography. The myth that all keys had to be private
was shattered. A new form of coding was born: public key cryptography.
The Diffie–Hellman–Merkle key exchange scheme showed that two
parties could establish a secret key by means of public messages. There
was a hitch, though. Their method required the exchange and process-
ing of multiple messages. As a result, the algorithm was not ideal for use
on networks. However, their paper did suggest an alternative.
Traditional encryption algorithms use a symmetric key, meaning that
the same key is used for encryption and decryption. The drawback with
symmetric encryption is that the key must be keep secret at all times.
This requirement creates the Key Distribution Problem.
In contrast, public key encryption uses two keys: an encryption key
and a different—asymmetric—decryption key. The pair of keys must meet
two requirements. First, they must work successfully as an encryption–
decryption pair, i.e. encryption with one and decryption with the other
must return a copy of the original message. Second, it must be impossi-
ble to determine the decryption key from the encryption key. Therein
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
lies the beauty of public key cryptography. If the decryption key cannot
be determined from the encryption key, then the encryption key can be
made public. Only the decryption key needs to be keep secret. Anyone
can use the public encryption key to send a secret message to the private
key holder. Only the recipient who possesses the private decryption key
can decipher and read the message.
Imagine that Alice wants to be able to receive encrypted messages
(Figure 7.6). She creates an asymmetric key pair by means of a key
generation algorithm. She keeps the decryption key to herself. She
publicly releases the encryption key on the Internet. Let’s say that Bob
wants to send a secret message to Alice. He obtains Alice’s encryption
key from her Internet posting. He encrypts the message using Alice’s
encryption key and sends the resulting ciphertext to Alice. On receipt,
Alice decrypts the ciphertext using her private decryption key. In short:
and colleagues, Adi Shamir and Leonard Adleman, to help him in the
search. All three held Bachelor’s degrees in Mathematics and PhDs in
Computer Science. Rivest (born 1947) hailed from New York state, Adi
Shamir (1952) was a native of Tel Aviv, Israel, and Adleman (1945) grew
up in San Francisco. The impromptu team spent a year generating
promising ideas for one-way functions only to discard each and every
one. None were truly one-way. Perhaps there was no such thing as a
one-way function.
The trio spent Passover 1977 as guests at a friend’s house. When Rivest
returned home, he was unable to sleep. Restless, his mind turned to the
one-way encryption problem. After a while, he hit upon a new function
that might just work. He wrote the whole thing down before dawn.
The next day, he asked Adleman to find a flaw in his scheme—just
as Adleman had done for every other suggestion. Curiously, Adleman
couldn’t find a weakness. The method seemed to be robust to attack.
They had found a one-way function. Rivest, Shamir, and Adleman’s
algorithm for key generation was published later that same year. It
quickly became known as the RSA algorithm from its inventors’ initials.
RSA is now the cornerstone of encryption on the Internet.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
To decipher:
Let’s say that the encryption key is (33, 7), the message is 4, and the
decryption key is (33, 3). Calculating 4 to the power of 7 (that is, 4
multiplied by itself 7 times) gives 16,384. The remainder after dividing
16,384 by 33 is 16. So, 16 is the ciphertext.
To decipher, calculate 16 to the power of 3 giving 4,096. The remain-
der after dividing 4,096 by 33 is 4. The output, 4, is the original message.
How does the scheme work? The process hinges on clock arithmetic.
You may have come across the number line—an imaginary line with the
integers marked off evenly along it, much like a ruler. Starting at zero,
the number line stretches off into infinity. Now imagine that there are
only 33 numbers on the line (0–32). Roll up this shortened line into a
circle. The circle looks much like the face of an old-fashioned wall clock
marked with the numbers 0 to 32.
Imagine starting at zero and counting around the clock. Eventually,
you come to 32 and after that the count goes back to 0, 1, 2 and so on.
You keep going around and around the clock face.
Clock arithmetic mirrors the effect of the remainder operation. Di-
viding 34 by 33 gives a remainder of 1. This is the same as going around
the clock once and taking one more step.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
In the example, encryption moves the clock hand 16,384 steps around
the clock face. In the end, the clock hand is left pointing to 16: the
ciphertext. Decryption starts at 0 and moves the clock hand 4,096 steps
around the clock face. In the end, the clock points to 4: the original
message.
The encryption and decryption keys are complementary. The key
pair is especially selected so that one exponent undoes the effect of the
other. The number of complete rotations around the clock face does
not matter. In the end, all that matters is the digit that the hand is
pointing to.
The key pair is produced by means of the RSA key generation al-
gorithm. This is the heart of the RSA encryption. The first two steps
contain the one-way function:
Attacking the key pair boils down to finding the two prime numbers
that were multiplied together to give the modulus. Multiplication
disguises the selected primes. For large numbers, there are many pairs
of prime numbers that might have been multiplied together to give the
modulus. An attacker would have to test a huge number of primes to
crack the code. When a large modulus is used, brute-force search for the
original prime numbers is prohibitively time-consuming.
The other steps in the key generation algorithm ensure that the
encryption and decryption are reciprocal. That is, decryption undoes
encryption for all values between zero and the modulus.
A Scientific American article introduced the RSA algorithm to a general
readership in 1977. The item concluded with a $100 challenge to crack
an RSA ciphertext given the encryption key. The modulus was reason-
ably large—129 decimal digits. It took seventeen years to crack the code.
The winning team comprised six hundred volunteers leveraging spare
time on computers all over the world. The plaintext turned out to be
the distinctly uninspiring: 136
The Magic Words are Squeamish Ossifrage
An ossifrage is a bearded vulture. The $100 prize worked out at 16 cents
per person. The point was proven. RSA is military-grade encryption.
Public key encryption is now built into the Secure Socket Layer of
the World Wide Web. When a web site address is preceded by “https:”,
your computer is using SLL and, with it, the RSA algorithm to com-
municate with the remote server. At last count, seventy per cent of
web site traffic used SSL. Over the years, the length of keys has had to
increase to prevent ciphertexts being cracked by the latest supercom-
puters. Today, keys containing 2,048 or more bits (617 decimal digits)
are commonplace.
The academics had proven the naysayers wrong. Governmental elec-
tronic espionage agencies could be beaten at their own game. At least,
that’s how it seemed.
Amidst the hullabaloo over RSA, the Head of the US National Se-
curity Agency stated publicly that the intelligence community had
known about public key encryption all along. This raised eyebrows.
Was his claim factual, bravado, or just plain bluff? Diffie’s curiosity was
piqued. He made discrete enquiries and was directed to look across the
Atlantic towards GCHQ. Government Communication Headquarters
is the UK’s electronic intelligence and security agency. During the
Second World War, it was the body that oversaw the code-breaking work
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
at Bletchley Park. Diffie wrangled two names from his contacts: Clifford
Cocks and James Ellis. In 1982, Diffie arranged to meet Ellis in a pub in
Cheltenham. A GCHQ stalwart to the last, the only hint that Ellis gave
was the cryptic: 137
You did a lot more with it than we did.
In 1997, the truth came out. GCHQ published the papers of Cocks,
Ellis, and another actor, Malcolm Williamson. Among the documents
was a history of events at GCHQ written by Ellis. In it, he refers to Diffie–
Hellman–Merkle’s ‘rediscovery’ of public key cryptography.
In the early 1970s, James Ellis hit upon the idea of public key cryp-
tography. His name for the technique was ‘non-secret encryption’. An
engineer by profession, Ellis couldn’t come up with a satisfactory one-
way function. Clifford Cocks happened to hear about Ellis’s discovery
over a cup of tea with Nick Patterson. Cocks, an Oxford and Cambridge
educated mathematician, found himself at a loose end that evening. He
decided to study the one-way encryption problem. Spectacularly, Cocks
solved the problem that evening. Just like the RSA team, he settled on
multiplication of two large prime numbers. This was four years ahead of
Rivest, Shamir, and Adleman. Cocks circulated the idea in the form of
an internal GCHQ paper. Malcolm Williamson picked up on the memo
and added a missing piece on key exchange some months later.
The codebreakers at GCHQ couldn’t find a flaw in Ellis, Cocks, and
Williamson’s unconventional method. Nevertheless, the hierarchy re-
mained unconvinced. Non-secret encryption languished in the office
drawers at GCHQ. Gagged by the Official Secrets Act, the authors
said nothing. They watched on the sidelines as the Stanford and MIT
teams reaped the glory. Ellis never gained the satisfaction of public
recognition. He passed away just a few weeks before the embargo on
his papers was lifted.
Rivest, Shamir, and Adleman won the ACM Turing Award for 2002.
Diffie and Hellman were honoured as the recipients of the 2015 award.
Meanwhile, the Internet grew and grew. By 1985, there were 2,000
hosts (computer sites) on the Internet. Most were owned by academic
institutions. While data transport worked well, the networking pro-
grams of the 1980s were unappealing. Their user interface was text-
heavy, monochrome, and cumbersome to use. To reach a wider audi-
ence, computing needed a makeover.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
8
Googling the Web
Amazon Recommends
In the year that Berners-Lee left CERN, a Wall Street investment banker
happened upon a startling statistic. Spurred by the popularity of the
Mosaic browser, web usage had grown 2,300 per cent year-on-year.
The number was ludicrous. Double digit growth is hard to come by,
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
nevermind four digit. The banker found lots of information on the web
but almost nothing for sale. Surely, this was an untapped market. The
question was: ‘What to sell?’
At the time, the Internet was way too slow to stream music or
videos. Product delivery would have to be by the US postal service. An
online store would be like a mail-order business—only better. Cus-
tomers could view an up-to-date product catalogue and place orders
via the web. The banker looked up a list of the top twenty mail-order
businesses. He concluded that book retail would be a perfect fit. The
investment banker had stumbled upon the opportunity of a lifetime.
At thirty years old, Bezos was DE Shaw & Company’s youngest ever
Senior Vice President (VP). Born in Albuquerque, New Mexico, Bezos
grew up in Texas and Florida. He attended Princeton University, grad-
uating in Computer Science and Electrical Engineering. After college,
he worked a series of computer and finance jobs, quickly climbing the
ladder to Senior VP.
Bezos’s brainwave left him with a dilemma. Should he pack in his
six-figure New York banking job to go selling books? Bezos called on
an algorithm that he employed for making life altering decisions like
this: 143
I wanted to project myself forward to age 80 and say, ‘OK, I’m looking
back on my life. I want to minimize the number of regrets I [will] have.’
I knew that when I was 80, I was not going to regret having tried this.
I knew the one thing I might regret is not ever having tried. [That] would
haunt me every day.
Bezos quit his Wall Street job and embarked on a mission to build a
bookstore in a place that didn’t really exist—online.
He needed two things to get his Internet business off the ground: staff
with computer skills and books to sell. Seattle on the northwest coast
of the US had both. The city was home to Microsoft and the country’s
largest book distributors. He and his wife of one year—MacKenzie Bezos
(née Tuttle)—boarded a plane to Texas. On arrival, they borrowed a car
from Bezos’ dad and drove the rest of the way to Seattle. MacKenzie
took the steering wheel while Bezos typed up his business plan on a
laptop computer.
Using his parents’ life savings as seed capital, Bezos set up shop in a
small two-bedroom Seattle house. On 16 July 1995, the amazon.com
website went live.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Figure 8.1 Greg Linden, designer of the first Amazon recommender system,
2000. (Courtesy Greg Linden.)
Let’s say that Charlotte’s Web and The Little Prince have been pair pur-
chased by four different users; Charlotte’s Web and Pinocchio were bought
by one customer; as were The Little Prince and Pinocchio (Table 8.1). Imagine
that Nicola web surfs to the Amazon site. The recommender algorithm
recovers her purchasing history and finds that she has only bought one
book so far: The Little Prince. The algorithm scans along the The Little Prince
row in the similarity table to find two non-zero entries: 4 and 1. Looking
up the corresponding column headers, the algorithm discovers that
Charlotte’s Web and Pinocchio are paired with The Little Prince. Since Charlotte’s
Web has the higher similarity score (4), it is presented to Nicola as the best
recommendation. In other words, Nicola is more likely to buy Charlotte’s
Web than Pinocchio since, in the past, more customers have bought both
The Little Prince and Charlotte’s Web.
PageRank does more than simply count citations: it takes into con-
sideration the importance of the pages that link to the page being rated.
This stops webmasters artificially raising the rank of a page by creating
spurious pages linking to it. To have any impact, the linking pages must,
themselves, be important. In effect, PageRank ranks webpages based on
the communal intelligence of website developers.
Every webpage is allocated a PageRank score: the higher the score,
the more important the page. The score for a page is equal to the sum
of the weighted PageRanks of the pages that link to it plus a damping term.
The PageRank of an incoming link is weighted in three ways. First, it
is multiplied by the number of links from the linking page to the page
being scored. Second, it is normalized, meaning that it is divided by the
number of links on the linking page. The rationale is that a hyperlink
from a page that contains many links is worth less than a hyperlink from
a page with a small number of links. Third, the PageRanks are multi-
plied by a damping factor. This damping factor is a constant value (typically
0.85) which models the fact that a user might jump to a random page
rather than follow a link. The damping term compensates for this by
adding one minus the damping factor to the total (usually 0.15).
PageRank can be thought of as the probability that a web surfer who
selects links at random will arrive a particular page. Lots of links to a
page mean that the random surfer is more likely to arrive there. Many
hyperlinks to those linking pages also means that the surfer is more
likely to arrive at the destination page. Thus, its PageRank depends not
only on the number of links to a page but also on the PageRank of
those linking pages. Therefore, the PageRank of a destination web-page
depends on all of the links that funnel into it. This funnelling effect
goes back one, two, three, and more links in the chain. This dependency
makes PageRank tricky to calculate. If every PageRank depends on every
other PageRank, how do we start the calculation?
The algorithm for computing PageRanks is iterative. Initially, the
PageRanks are set equal to the number of direct incoming links to a
page divided by the average number of inward links. The PageRanks
are then recalculated as the total of the weighted incoming PageRanks
plus the damping term. This gives a new set of Page Ranks. These values
are then used to calculate the PageRanks again, and so on. With every
iteration, the calculation brings the PageRanks closer to a stable set of
values. Iteration stops when the PageRanks show no further change.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Table 8.2 Table showing the number of links between pages. The table also
includes the total number of links. A page may not link to itself.
To A To B To C To D To E Total out
From A - 1 1 0 0 2
From B 0 - 0 0 1 1
From C 0 0 - 1 1 1
From D 1 1 0 - 1 3
From E 1 1 0 0 - 2
Total in 2 3 1 1 3
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
The algorithm creates a second table listing the PageRanks for every
page (Table 8.3). To start with, the algorithm populates the table with a
rough estimate of the PageRanks. This is the number of incoming links
for a page divided by the average number of inward links.
The algorithm then recalculates the PageRanks page by page, that
is, column by column in the PageRank table. A page is processed by
calculating the incoming weighted PageRanks for a single column in
the links table. The PageRank of each incoming link is obtained by
looking up the current PageRank for that page (Table 8.3). This value is
multiplied by the number of incoming links from the page. The result
is divided by the total number of outgoing links on the page (Table 8.2).
This value is multiplied by the damping factor (0.85) to obtain the
weighted PageRank. This calculation is performed for all incoming
links. The resulting weighted incoming PageRanks are totalled and
added to the damping term (0.15). This gives the new PageRank for the
page. This value is appended to the PageRank table.
The PageRank calculation is repeated for all of the pages. When all
of the pages have been processed, the new PageRanks are compared to
the previous values. If the change is small, the process has converged
and the results are output. Otherwise, the calculation is repeated (the
complete algorithm is listed in detail in the Appendix).
For example, imagine performing the second iteration in calculating
the PageRank of page A. Its PageRank is the sum of the weighted
PageRanks of the pages with incoming links, that is, pages D and E. The
initial PageRank of D is 0.5. The number of links from D to A is 1. The
number of links from D is 3. Therefore, the weighted PageRank from D
to A is 0.5 × 1 ÷ 3 = 0.167. Similarly, the weighted PageRank from E to
A is 1.5 × 1 ÷ 2 = 0.75. Including the damping term, the new PageRank
for A is 0.167 + 0.75 + 0.15 = 1.067.
The iterative procedure balances the PageRanks so that they reflect
the inter-connection of the webpages. The larger values flow to the
Iteration A B C D E
1 1.00 1.50 0.50 0.50 1.50
5 0.97 1.38 0.56 0.40 1.69
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
it. In spirit, he felt he was paying back an old favour granted by one of his
early backers. Clutching the cheque, Brin and Page neglected to point
out that Google, Inc., didn’t exist—yet.
The following February, PC Magazine reported that the fledgling
Google search engine: 153
has an uncanny knack for returning extremely relevant results.
One year later, Google, Inc., received $25 million in funding from
venture capitalists Sequoia Capital and Kleiner Perkins. Sequoia Capital
and Kleiner Perkins were, and still are, Silicon Valley royalty. Their
names on the investor sheet were almost worth more than the money.
The following year, Google launched AdWords. AdWords allows
advertisers to bid for their links to be listed on the Google search page
alongside the PageRank results. The promoted links are clearly delin-
eated from the PageRank results so that users can distinguish between
PageRank results and advertisements. AdWords proved to be much
more effective than traditional advertising. This was hardly surprising,
as AdWords was offering products that customers were already actively
searching for. Companies flocked to the new advertising platform.
AdWords turned free web search into a goldmine.
The PageRank algorithm was patented by Stanford University.
Google exclusively licensed the algorithm back from Stanford for use
in its search engine in exchange for 1.8 million shares. In 2005, Stanford
University sold its Google shares for $336 million. That transaction
probably makes PageRank the most valuable algorithm in history.
9
Facebook and Friends
With all these data you should be able to draw some just inference.
Sherlock Holmes to Dr Watson
Sir Arthur Conan Doyle
The Sign of Four, 1890 160
Mark Zuckerberg (Figure 9.1) was born in White Plains, New York, in
1984. His father taught him how to program when he was in middle
school. Later, his dad hired a professional programmer to tutor him.
A whiz in high school, Zuckerberg was always destined for the Ivy
League. He enrolled in Harvard University and selected joint honours
in Computer Science and Psychology.
As well as a love of coding, Zuckerberg had an abiding interest in
the behaviour of people. He realized early on that most people are
fascinated by what other people are doing. This obsession lies at the
very heart of everyday gossip, weighty biographies, celebrity culture,
and reality TV. At Harvard, he began to experiment with software
that would facilitate the basic human need to connect and interact
with others.
Zuckerberg set up a website called Facemash. 162 Facemash was mod-
elled on an existing web site by the name of Hot Or Not. 161 Both sites
displayed side-by-side images of two male, or two female, students
and asked the user to select the more attractive of the two. Facemash
collated the votes and displayed ranked lists of the ‘hottest’ students.
Controversially, Facemash used photographs of students downloaded
from Harvard web sites. The site was popular with a certain cohort of
students but upset a lot of others. According to the on-campus newslet-
ter, the site landed Zuckerberg in front of a disciplinary board. 163
Afterwards, Zuckerberg turned to constructing a new website for
social networking. A few such sites already existed, allowing users to share
information about themselves. Most previous sites were aimed at people
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Figure 9.1 Facebook co-founder Mark Zuckerberg, 2012. (By JD Lasica from
Pleasanton, CA, US - Mark Zuckerberg, / Wikimedia Commons / CC BY 2.0, https://commons.
wiki-media.org/w/index.php?curid=72122211. Changed from colour to monochrome.)
In the beginning, the only way to find a fresh post was to check users’
profile pages for updates. Mostly, checking pages was a waste of time—
there was just nothing new to see. Zuckerberg realized that it would
be helpful for users to have a page summarizing the latest posts from
their pals.
Over the next eight months, the Facebook News Feed algorithm was
born. The algorithm proved to be the biggest engineering challenge
that the young company had faced. News Feed wasn’t just a new
feature. It was a re-invention of Facebook.
The idea was that News Feed would produce a unique news page for
every single user. The page would list the posts most relevant to that
particular user. Everyone’s feed would be different – personalized for
them by the system.
Facebook activated News Feed on Tuesday, 5 September 2006. User
reaction was almost unanimous. Everyone hated it. People felt it was
too stalker-esque. In fact, nothing was visible that had not been available
on Facebook previously. However, News Feed made it easier to see what
was going on in everyone else’s lives. It seemed that Zuckerberg had
misjudged users’ emotional reaction to a perceived change in their data
privacy.
Anti-News Feed groups sprang up on Facebook and flourished. Iron-
ically, students were using the very feature that they were objecting to,
to help them protest. For Zuckerberg, this was proof positive that News
Feed had worked. The numbers backed him up. Users were spending
more time on Facebook than ever before. Facebook made its apologies,
added privacy controls, and waited for the fuss to die down.
The engineering challenge at the core of News Feed lay in creating
an algorithm that would select the best news items to display to a user.
The question was: how could a computer algorithm possibly determine
what a human user was most interested in? The details of Facebook’s
News Feed algorithm remain a closely guarded secret. However, some
information was disclosed in 2008.
The original News Feed algorithm was called EdgeRank. The name
seems to have been a nod to Google’s PageRank. Every action on Face-
book is called an edge, be it a user post, a status update, a comment, a
like, a group join, or an item share. An EdgeRank score is calculated for
every user and every edge by multiplying three factors:
EdgeRank = affinity × weight × time decay.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
star ratings. The qualifying dataset was much smaller, containing just
2.8 million entries.
The goal of the competition was to build a recommender that could
accurately predict the redacted movie ratings in the qualifying dataset.
Netflix would compare the estimates provided by a competitor with the
hidden user assigned ratings. The competitor’s estimates were evaluated
by measuring the prediction error—the average difference between the
predicted and actual ratings squared.
The $1 million prize attracted hobbyists and serious academic re-
searchers alike. As far as the academics were concerned, the dataset was
gold dust. It was very difficult to get hold of real-world datasets of this
size. At the outset, most thought that a ten per cent improvement in
accuracy was going to be trivial. They underestimated the effectiveness
of Cinematch.
Ratings predictions can be made in a large number of ways. The most
effective technique used in the competition was to combine as many
different predictions as possible. In predictor parlance, any information
which can be used as an aid to prediction is a factor that must be taken
into account in the final reckoning.
The simplest factor is the average rating for the movie in the training
dataset. This is the average across all users who have watched that
particular movie.
Another factor that can be considered is the generosity of the user
whose rating is being predicted. A user’s generosity can be calculated
as their average rating minus the average across all users for the same
movies. The resulting generosity modifier can be added to the average
rating for the movie being predicted.
Another factor is the ratings given to the movie by users who gener-
ally score in the same way as the user in question. The training dataset
is searched for these users. The prediction is then the average of their
ratings for the movie.
Yet another factor is the ratings given by the user to similar movies.
Again, the training dataset is inspected. This time, movies typically
allocated similar ratings to the show in question are identified. The
average of the user’s ratings for these films is calculated.
These factors, and any others available, are combined by weighting
and addition. Each factor is multiplied by a numeric value, or weight.
These weights control the relative importance of the factors. A high
weight means that the associated factor is more important in deter-
mining the final prediction. A low weight means that the factor is of
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
less importance. The weighted factors are summed to give the final
prediction.
In summary, the prediction algorithm is then:
Imagine that the algorithm is trying to predict Jill’s rating for Toy
Story (Table 9.1). The first factor is simply the average rating for Toy Story
in the training dataset. This gives an estimate of 3.7 stars. The second
factor—Jill’s generosity—is obtained by calculating Jill’s average rating
and subtracting the average rating for the same movies in the training
dataset. Jill’s average rating is 4 while the average for the same movies is
3.1 Therefore, Jill’s generosity bonus is +0.9. Adding this to the global
average gives 4.6 stars. Next, the dataset is searched for users whose
previous ratings are similar to Jill’s. This is clearly Ian and Lucy. They
gave Toy Story 5 stars, so that’s another factor. The fourth factor requires
that movies that Jill has watched which normally attain similar ratings
to Toy Story are found. The obvious ones are Finding Nemo and The Incredibles.
Jill gave these two movies an average rating of 4.5 stars. So, that is
another factor. In conclusion, we have four estimates of Jill’s rating of
Toy Story: 3.7, 4.6, 5, and 4.5 stars.
If the last two factors are typically the most reliable, we might well
use the weights: 0.1, 0.1, 0.4, 0.4. Multiplying and summing gives a final
prediction of 4.6 stars. Jill should definitely watch Toy Story!
Although this general approach was common, teams varied in the
specific features they exploited. Sixty features or more were not un-
usual. Teams also experimented with a wide range of similarity met-
rics and ways to combine predictions. In most systems, the details of
the predictions were controlled by numerical values. These numerical
values, or parameters, were then tuned to improve the prediction results.
For example, the weight parameters were tweaked to adjust the relative
importance of factors. Once the factors are identified, accurate rating
prediction depends on finding the best parameter values.
To determine the optimum parameter values, teams turned to ma-
chine learning (see Chapter 5). To begin with, a team sets aside a subset
of the training dataset for validation. Next, the team guesses the param-
eter values. A prediction algorithm is run to obtain predictions for
the ratings in the validation dataset. The prediction error is measured
for these estimates. The parameters are then adjusted slightly in the
hope of reducing the prediction error. These prediction, evaluation, and
parameter adjustment steps are repeated many times. The relationship
between the parameter values and the prediction error is monitored.
Based on the relationship, the parameters are tuned so as to minimize
the error. When no further reduction in error can be obtained, training
is terminated and the parameter values frozen. These final parameter
values are used to predict the missing ratings in the qualifying dataset
and the results submitted to Netflix for adjudication.
Overall, the training algorithm works as follows:
pretty good payday. Pity the second placed team who, after three years,
lost by just ten minutes.
In a plot twist, Netflix decided not to deploy the winning algorithm.
The company had already replaced Cinematch with the winner of an
earlier stage. Netflix reckoned that an 8.43% improvement was good
enough and left it at that.
The competition was a resounding success. In total, 51,051 contes-
tants from 186 countries organized into 41,305 teams had entered.
While the competition was underway, Netflix announced a shift
from disk rental to online streaming of movies across the Internet.
The company soon discontinued its disk rental service entirely. Today,
Netflix is the world’s largest Internet television network, with more
than 137 million subscribers. Marc Randolph left Netflix in 2002. He is
now on the board of several tech companies. Reed Hastings continues
as Netflix CEO. He is on Facebook’s Board of Directors as well. Hastings
is currently at number 504 on the Forbes Billionaires List (2019).
McKinsey recently reported that a staggering seventy-five per cent of
Netflix’s views are based on system recommendations. Netflix estimates
that personalization and movie recommendation services save the com-
pany $1 billion every year through enhanced customer retention.
In 2009, it was starting to seem like big data, coupled with machine
learning, could predict almost anything.
peak of the CDC data by more than fifty per cent—a huge error. Soon,
more Google Flu Trends errors were reported. One research group even
demonstrated that Google Flu Trends’ predictions were less accurate
than basing today’s flu predictions on two-week-old CDC data. Amidst
the clamour, Google Flu Trends was discontinued.
What went wrong?
With hindsight, the original work used too little CDC data for train-
ing. There was so little CDC data and so many queries, that some queries
just had to match the data. Many of these matches were random. The
statistics happened to match but the queries were not a consequence of
anyone actually having the flu. In other words, the queries and flu data
were correlated but there was no causal relationship. Second, there was
too little variability in the epidemics captured in the training data. The
epidemics happened at much the same time of year and spread in similar
ways. The algorithm had learned nothing about atypical outbreaks.
Any deviation from the norm and it floundered. Third, media hype
and public concern surrounding the deaths in 2013 probably led to
a disproportionate number of flu queries which, in turn, caused the
algorithm to overshoot.
The bottom line is that machine learning algorithms are only as good
as the training data supplied to them.
The science of nowcasting has moved on significantly since Google
Flu Trends. Real-world conditions are now determined at scale and
low cost by means of networked data-gathering devices and analyt-
ics algorithms. Sentiment analysis applied to Twitter posts is used to
predict box office figures and election outcomes. Toll booth takings are
used to estimate economic activity in real-time. The motion sensors
in smartphones have been monitored to detect earthquakes. Undoubt-
edly, nowcasting of disease epidemics will be revisited in the future with
the aid of more reliable health sensors.
Meanwhile, in 2005, one year before the Netflix Prize was launched,
a group of IBM executives were on the hunt for a fresh computing
spectacular. The 1997 defeat of world Chess champion Garry Kasparov
by IBM’s Deep Blue computer had hit headlines around the world. That
victory was more about computer chip design than novel algorithms.
Nevertheless, the event was a milestone in the annals of the computing.
IBM wanted a sequel for the new millennium. The challenge had to be
well chosen—a seemingly impossible feat that would grab the general
public’s attention. Something truly amazing …
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
10
America’s Favourite Quiz Show
$2.5 million in prize money. Jennings, now 36, had been a computer
programmer before his Jeopardy success.
Brad Rutter had amassed the greatest winnings in Jeopardy history—
$3.25 million. Rutter was four years Jennings’ junior and worked in
a record store before his first appearance on the show.
The seeds of the IBM Jeopardy Challenge were sown six years previ-
ously. At the time, IBM management were on the lookout for a spec-
tacular computing event. They wanted something that would capture
the public imagination and demonstrate the capabilities of IBM’s latest
machines.
With this in the back of his mind, Director of IBM Research Paul Horn
happened to be on a team night out at a local restaurant. Mid-meal
the other diners left their tables en mass and congregated in the bar. He
turned to his colleagues and asked, ‘What’s going on?’. Horn was told
that everyone else was watching Jeopardy on TV. Ken Jennings was on his
record-breaking winning streak. Half the country wanted to see if he
could keep it going. Horn paused and wondered if a computer could
play Jeopardy.
Back at base, Horn pitched the idea to his team. They didn’t like
it. Most of the researchers reckoned that a computer wouldn’t stand
a chance at Jeopardy. The topics were too wide ranging. The questions
were too cryptic. There were puns, jokes, and double meanings—all
stuff that computers weren’t good at processing. Regardless, a handful
of staff decided to give it a go.
One of the volunteers, Dave Ferrucci, later became Principal Inves-
tigator on the project. A graduate of Rensselaer Polytechnic Institute
in New York, Ferrucci had joined IBM Research straight out of college
after gaining a PhD in Computer Science. His specialism was knowledge
representation and reasoning. He was going to need that expertise—
this was the toughest natural language processing and reasoning chal-
lenge around.
IBM’s first prototype was based on the team’s latest research. The
machine played about as well as a five-year-old child. The Challenge
wasn’t going to be easy. Twenty-five IBM Research scientists were to
spend the next four years building Watson.
By 2009, IBM were confident enough to call the producers of Jeopardy
and suggest that they put Watson to the test. The show executives
arranged a trial against two human players. Watson didn’t perform
at all well. Its responses were erratic—some were correct, others
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
the focus is ‘he’—an individual male; the answer type is ‘clerk’ and
‘writer’ (implicit); plus the question classification is ‘factoid’—a short
factual piece of information.
Once clue analysis is complete, Watson searches for answers in its
database. Watson launches a number of searches. These searches access
the structured and unstructured data held in Watson’s memory banks. Struc-
tured data is the name for information held in well-organized tables.
Structured tabular data is great for factoid lookup. For example, Watson
can look up Songs of a Sourdough in a table containing the titles and writers
of well-known songs. However, given the obscurity of the poem, this
hunt is likely to be fruitless.
Unstructured data is the term for information which is not formally
organized. Unstructured data includes information held in textual doc-
uments, such as newspapers or books. Plenty of knowledge is contained
therein but it is difficult for a computer to interpret. Retrieving useful
information from unstructured data turned out to be one of the biggest
problems in building Watson. In the end, the team found some surpris-
ingly effective tricks.
One technique involves searching for an encyclopaedia article men-
tioning all of the words in the clue. Often, the title of the article is the
sought-after answer. For example, searching Wikipedia for the words
‘bank clerk Yukon Songs of a Sourdough 1907’ returns an article entitled
‘Robert W. Service’. This is the correct answer.
Another option is to search for a Wikipedia entry whose title is the
focus of the clue. The algorithm then hunts for the desired information
in the body of the selected article. For example, Watson handles the
clue ‘Aleksander Kwasniewski became the president of this country
in 1995’ by looking up an article entitled ‘Aleksander Kwasniewski’
in Wikipedia. The computer scans the article for the most frequently
occurring country name.
Watson launches a battery of such searches in the hope that one will
yield accurate results.
The resulting candidate answers are assessed by calculating how well
the answers meet the requirements of the clue. Every aspect of the
answers and the clue are compared and scored. The answer with the
highest score is selected as the best solution. The score is compared with
a fixed threshold. If the score exceeds the threshold, Watson rephrases
the solution as a question and presses the buzzer. If called on, Watson
offers the question to the quizmaster.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Watson’s roots lie in the expert systems and case-based reasoning technolo-
gies of the 1970s and 1980s.
Expert systems use hand written if-then-else rules to transform tex-
tual inputs into outputs. The first popular expert system, MYCIN, was
developed by Edward Feigenbaum’s team at Stanford University. It was
designed to assist physicians in determining whether an infection is
bacterial or viral. Bacterial infections can be treated with antibiotics,
whereas viral infections are unresponsive to medication. Doctors com-
monly over-prescribe antibiotics, mistakenly recommending them for
viral infections. MYCIN assists the prescribing physician by asking a
series of questions. These probe the patient’s symptoms and the out-
comes of diagnostic tests. The sequence of questions is determined by
lists of hand-crafted rules embedded in the MYCIN software. MYCIN’s
final diagnosis—bacterial or viral—is based on a set of rules defined by
medical experts.
Case-based reasoning (CBR) systems allow for more flexible decision-
making than expert systems. The first working CBR system is widely
regarded to have been CYRUS, developed by Janet Kolodner at Yale
University. CYRUS is a natural language information retrieval system.
The system holds the biographies and diaries of US Secretaries of State
Cyrus Vance and Edmund Muskie. By referral to these information
sources, CYRUS enters into a dialog with the user, answering questions
about the two subjects. For example: 190
Question: Who is Cyrus Vance?
Answer: Secretary of State of the United States.
Q: Does he have any kids?
A: Yes, Five
Q: Where is he now?
A: In Israel
CYRUS generates candidate answers by matching the query with
passages from the documents. All matches found are scored according
to similarity. The candidate answer with the greatest score is phrased
appropriately and returned to the user.
The main drawback with expert systems is that every rule and con-
sideration must be manually programmed into the system. Case-based
reasoning, on the other hand, requires that every potential linguistic
nuance in the questions and source material is dealt with by program.
Due to the complexity of natural language, the mapping between a
question and valid answer is complex and convoluted. Algorithms must
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
11
Mimicking the Brain
Brain Cells
By the onset of the twentieth century, neuroscientists had established
a basic understanding of the human nervous system. The work was
spearheaded by Spanish neuroscientist Santiago Ramón y Cajal, who
received the Nobel Prize for Medicine in 1906.
The human brain is composed of around 100 billion cells, or neurons.
A single neuron is made up of three structures: a central body, a set of
input fibres called dendrites, and a number of output fibres called axons.
When inspected under a microscope, the long thin wispy dendrites and
axons stretch away from the bulbous central body, branching as they
go. Every axon (output) is connected to the dendrite (input) of another
neuron via a tiny gap, called a synapse. Neurons in the brain are massively
interconnected. A single neuron can be connected to as many as 1,500
other neurons.
The brain operates by means of electrochemical pulses sent from one
neuron to another. When a neuron fires, it sends a pulse from its central
body to all of its axons. This pulse is transferred to the dendrites of
the connected neurons. The pulse serves to either excite or inhibit the
receiving neurons. Pulses received on certain dendrites cause excitation,
and pulses received on others cause inhibition. If a cell receives sufficient
excitation from one or more other neurons, it will fire. The firing of one
cell can lead to a cascade of firing neurons. Conversely, pulses arriving
at inhibitory neural inputs reduce the level of excitation, making it less
likely that a neuron will fire. The level of excitation, or inhibition, is
influenced by the frequency of the incoming pulses and the sensitivity
of the receiving dendrite.
Canadian neuropsychologist Donald Hebb discovered that when a
neuron persistently fires, a change takes place in the receiving dendrites.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
The weights and biases of the neurons are collectively called the
parameters of the network. Their values determine the conditions under
which the neurons fire, that is, output a 1. Parameters can be positive or
negative numbers. An input of 1 applied to a positive weight increases
the excitation of the neuron, making it more likely to fire. Conversely,
an input of one multiplied by a negative weight decreases the neu-
ron’s excitation, making it less likely to fire. The bias determines how
much input excitation is needed before the neuron exceeds the fixed
threshold and fires. During normal network operation, the parameters
are fixed.
Perceptrons take a number of inputs (Figure 11.4). For example,
Rosenblatt’s 20x20 pixel image is input to the Perceptron on 400 con-
nections. Each connection has a 0 or 1 value depending whether the
associated pixel is black or white. The input connections are fed into
the first layer of the network: the input layer. Neurons in the first layer
only have a single input connection. Thereafter, the outputs from one
layer feed into the inputs of the next layer. In a fully connected network
all outputs from one layer are input to every neuron in the following
layer. The output of the input layer is connected to the first hidden layer.
Hidden layers are those not directly connected to either the network
inputs or outputs. In a simple network, there might be only one hidden
layer. After the hidden layers comes the output layer. The outputs from
the neurons in this layer are the final network outputs. Each output
neuron corresponds to a class. Ideally, only the neuron associated with
the recognized class should fire (i.e. output a 1).
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Figure 11.4 Perceptron with three inputs, two fully connected layers, and two
output classes.
the output connection with the largest value. The change has the side
effect of improving robustness when the input is on the cusp of two
classes.
Normal operation of an ANN is called forward-propagation (or inference).
The ANN accepts an input, processes the values neuron-by-neuron,
layer-by-layer, and produces an output. During forward-propagation,
the parameters are fixed.
Backprop is only used in training, and operates within a machine
learning framework (see Chapter 9). To begin, a dataset containing a
large number of example inputs and associated outputs is assembled.
This dataset is split into three. A large training set is used to determine
the best parameter values. A smaller validation set is put aside to assess
performance and guide training. A test set is dedicated to measuring the
accuracy of the network after training is finished.
Training begins with random network parameters. It proceeds by
feeding an input from the training set into the network. The network
processes this input (forward-propagation) using the current parame-
ter vales to produce an output. This network output is compared with
the desired output for that particular input. The error between the
actual and desired output is measured as the average difference between
the actual and desired outputs squared.
Imagine that the network has two classification outputs: circle and
triangle. If the input image contains a circle then the circle output
should be 1, and the triangle output 0. Early in the training process,
the network probably won’t work at all well, since the parameters are
random. So, the circle output might have a value of 23 and the triangle 13 .
The error is equal to the average of (1 − 23 ) squared and (0 − 13 ) squared,
that is 29 .
The parameters in the network are then updated based on the error.
The procedure commences with the first weight in the first neuron of
the output layer. The mathematical relationship between this partic-
ular weight and the error is determined, and this relationship is used
to calculate how much the weight should change to reduce the error
to zero. This result is reduced by a constant value, called the learning
rate, and subtracted from the current weight. The weight adjustment
has the effect of reducing the error if the network is presented with the
same input again. Multiplication by the learning rate ensures that the
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Figure 11.5 Artificial neural network innovator Yann LeCun, 2016. (Courtesy
Facebook.)
Recognizing Digits
LeCun was born in Paris in 1960. He received the Diplôme d’Ingénieur
degree from the École Supérieure d’Ingénieurs en Electrotechnique et
Electronique (ESIEE) in 1983. In his second year, he happened upon a
philosophy book discussing the nature versus nurture debate in child-
hood language development. Seymour Papert was one of the contrib-
utors. From there, he found out about Perceptrons. He started reading
everything he could find on the topic. Pretty soon, LeCun was hooked.
He specialized in neural networks for his PhD at Université Pierre et
Marie Curie (1987). After graduation, LeCun worked for a year as a
post-doctoral researcher in Geoffrey Hinton’s lab at the University of
Toronto in Canada. By then, Hinton was an established figure in the
neural network community, having been a co-author on the backprop
letter published in Nature. A year later, LeCun moved to AT&T Bell
Laboratories in New Jersey to work on neural networks for image
processing.
LeCun joined a team that had been working on building a neural
network to recognize hand written digits. Conventional algorithms
weren’t at all good at this—there was too much variability in the
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
writing styles. Casually written 7s are easily confused with 1s, and
vice versa. Incomplete 0s can be interpreted as 6s, and 2s with long
tails are often mixed up with truncated 3s. Rule-based algorithms just
couldn’t cope.
The team acquired a large dataset of digital images by scanning
the zip codes from the addresses on envelopes passing through the
Buffalo, New York, post office. Every letter yielded five digits. In the end,
the dataset incorporated 9,298 images. These examples were manually
sorted into ten classes corresponding to the ten decimal digits (0 to 9).
The team developed an ANN to perform the recognition task but
had little success with it. The complex mapping necessitated a large net-
work. Even with backprop, training the network had proven difficult.
To solve the problem LeCun suggested an idea that he had tinkered with
in Hinton’s lab.
The ANN took a 16x16 pixel greyscale image of a single digit as input.
The network output consisted of ten connections—one for each digit
class between 0 and 9. The output with the strongest signal indicated
the digit recognized.
LeCun’s idea was to simplify the network by breaking it up into
lots of small networks with shared parameters. His approach was to
create a unit containing just twenty-five neurons and a small number
of layers. The input to the unit is a small portion of the image—a
square of 5x5 pixels (Figure 11.6). The unit is replicated sixty-four times
to create a group. The units in the group are tiled across the image, so
Figure 11.6 Grayscale image of the digit 7 (16×16 pixels). The pixel inputs to
the unit at the top left are highlighted. Sixty-four copies of this unit are spread
over the image.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
that it is entirely covered in units. Every unit’s input overlaps with its
neighbour’s by three pixels.
The overall network contains twelve groups. Since the units in a
group share the same parameters, all perform the same function but
are applied to a different part of the image. Each group is trained to
detect a different feature. One group might detect horizontal lines in
the picture, another vertical, still another diagonal. The outputs from
each group are fed into a fully connected three-layer network. These
final layers fuse the information coming from the groups and allow
recognition of the digit in its entirety.
The network is hierarchical in structure. A single unit detects a 5×5
motif in the image. A group spots a single motif anywhere in the image.
The twelve groups detect twelve different motifs across the image. The
final, fully connected layers detect the spatial relationships between the
twelve motifs. This hierarchical organization draws inspiration from the
human visual cortex, wherein units are replicated and successive layers
process larger portions of the image.
The beauty of LeCun’s scheme is that all of the units in a single group
share the same weights. As a consequence, training is greatly simplified.
Training the first layers in the network only involves updating twelve
units, each containing just twenty-five neurons.
The mathematical process of replicating and shifting a single unit
of computation across an image is called convolution. Hence, this type of
network became known as a convolutional neural network.
The Bell Labs convolutional neural network proved to be extremely
effective, achieving a breathtaking accuracy of ninety-five per cent.
The network was close to human accuracy. The team’s findings were
published in 1989 and the system commercialized by AT&T Bells Labs.
It was estimated that, in the late 1990s, ten to twenty per cent of bank
cheques signed in the US were automatically read by a convolutional
neural network.
In 2003, Yann LeCun left Bell Labs to be appointed Professor of Com-
puter Science at New York University. Meanwhile, Geoffrey Hinton, his
old mentor at the University of Toronto, was putting together a posse.
Deep Learning
Hinton (Figure 11.7) was born in post-war Wimbledon, England (1947).
Hinton reckons that he wasn’t particularly good at maths in school.
Nevertheless, he gained entry to the University of Cambridge, enrolling
in Physics and Physiology. Dissatisfied, he transferred to Philosophy.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Figure 11.7 Deep neural network pioneer Geoffrey Hinton, 2011. (Courtesy
Geoffery Hinton.)
The resulting ANN achieved an accuracy of 89.75 per cent, which was
not as good as LeCun’s convolutional neural network. However, that
wasn’t the point. They had proven that, by means of pretraining, a deep,
fully connected network could be trained. The road to deeper and more
effective networks was open.
Over the course of the next decade, deep learning gained mo-
mentum. The confluence of three advances enabled researchers
to build larger and deeper networks. Smarter algorithms reduced
computational complexity, faster computers reduced run-times, and
larger datasets allowed more parameters to be tuned.
In 2010, a team of researchers in Switzerland conducted an exper-
iment to see if increasing the depth of a neural network really did
translate into improved accuracy. Led by long-time neural network
guru Jürgen Schmidhuber, the group trained a six-layer neural network
to recognize digits. Their network contained a whopping 5,710 neurons.
They, like Hinton’s group, used the MNIST dataset of hand written
digits. However, even MNIST wasn’t big enough for Schmidhuber’s
team’s purposes. They artificially generated additional digit images by
distorting the MNIST photographs.
The resulting ANN achieved an accuracy of 99.65 per cent. This
wasn’t just a world record, this was human-level performance.
Suddenly, it dawned on everyone that ANNs had been too small to
be of any practical use. Deep networks were the way to go. A revolution
in artificial intelligence was at hand.
The Tsunami
The deep learning tsunami hit in three waves: first, speech recognition,
then image recognition, next natural language processing. Half a
century of pattern recognition research was swept away in just
three years.
For sixty years, the tech community had struggled to accurately
convert spoken words to text. The best algorithms relied on the
Fourier transform (see Chapter 2) to extract the amplitude of
the harmonics. Hidden Markov Models (HMMs) were then used
to determine the phonemes uttered based on the observed har-
monic content and the known probability of sound sequences in
real speech.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
With the help of Navdeep Jaitly, an intern from Hinton’s Lab, Google
ripped out half of their production speech recognition system and
replaced it with a deep neural network. They resulting hybrid ANN–
HMM speech recognition system contained a four-layer ANN. The team
trained the ANN with 5,870 hours of recorded speech sourced from
Google Voice Search, augmented with 1,400 hours of dialogue from
YouTube. The new ANN–HMM hybrid outperformed Google’s old
HMM-based speech recognition system by 4.7 per cent. In the context
of automatic speech recognition, this was a colossal advance. With
his mission at Google accomplished, Jaitly—intern extraordinaire—
returned to Toronto to finish his PhD.
Over the course of the next five years, Google progressively extended
and improved their ANN-based speech recognition system. By 2017,
Google’s speech recognition system had attained ninety-five per cent
accuracy—a previously unheard-of level accuracy.
In 2012, Hinton’s group reported on a deep neural network designed
to recognize real-world objects in still images. The objects were every-
day items such as cats, dogs, people, faces, cars, and plants. The problem
was a far cry from merely recognizing digits. Digits are made up of lines,
but object identification requires analysis of shape, colour, texture, and
edges. On top of that, the number of object classes to be recognized
greatly exceeded the paltry ten Hindu–Arabic digits.
The network—dubbed AlexNet after lead designer Alex Krizhevsky—
contained 650,000 neurons and sixty million parameters. It incorporated
five convolutional layers followed by three fully connected layers.
In addition, the work introduced a simple, yet surprisingly effective,
technique. During training, a handful of neurons are selected at
random and silenced. In other words, they are prevented from firing.
Drop-out, as the technique was named, forces the network to spread the
decision-making load over more neurons. This has the effect of making
the network more robust to variations in the input.
The team entered the network into the ImageNet Large Scale Vi-
sual Recognition Challenge in 2012. The dataset for the competition
consisted of approximately 1.2 million training images and 1,000 object
classes. Krizhevsky, Ilya Sutskever, and Hinton’s deep convolutional
network swept the boards. AlexNet achieved a top five accuracy of 84.7
per cent. That is to say, the true object class was among the ANN’s top
five picks more than 84 per cent of the time. The network’s error rate
was almost half that of the second placed system.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Meanwhile, just 500 km east along the St. Lawrence River from
Toronto, a team at the Université de Montréal was investigating how
deep neural networks could be applied to the processing of text. That
team was led by Yoshua Bengio (Figure 11.8).
Hailing from Paris, France (born 1964), Bengio was one of the lead-
ing lights of the neural network renaissance. He studied Electronic
Engineering and Computer Science at McGill University in Montreal,
obtaining BEng, MSc, and PhD degrees. A science fiction fan as an
adolescent, Bengio became passionate about neural network research
as a graduate student. He devoured all of the early papers on the topic.
A self-professed nerd, he set out to build his own ANN. After working
as a post-doctoral researcher at AT&T Bell Labs and MIT, Bengio joined
the Université de Montréal as a faculty member in 1993. Bengio’s team
trained ANNs to predict the probability of word sequences in text.
In 2014, Google picked up on Bengio’s work and adapted it to the
problem of translating documents from one language to another. By
then, the Google Translate web service had been in operation for eight
years. The system relied on conventional approaches to segment sen-
tences and map phrases from one language to another. On the whole,
Figure 11.8 Neural network researcher Yoshua Bengio, 2017. (© École polytech-
nique - J. Barande.)
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
12
Superhuman Intelligence
The Match
The AlphaGo-Lee match is a best-of-five games contest. An estimated
sixty million television viewers are watching the game on television
in China alone. A hundred thousand enthusiasts are glued to the live
English language coverage on YouTube.
The DeepMind team spectates from a war room in the bowels of the
hotel. The room is kitted out with a wall of monitors. Some screens dis-
play camera feeds from the match room. Others show lists of numbers
and graphs summarizing AlphaGo’s analysis of the game. DeepMind
CEO Demis Hassabis and lead project researcher Don Silver watch the
match unfold from this vantage point. Like the rest of their team,
Hassabis and Silver are anxious, but powerless.
Day one, game one. Lee places the first stone. Bizarrely, it takes
AlphaGo half a minute to respond. The AlphaGo team holds its breath.
Is the machine working at all? Finally, it makes its decision and Huang
places AlphaGo’s first stone.
AlphaGo attacks from the outset. Lee seems mildly surprised. Al-
phaGo isn’t playing like a computer at all. Then comes AlphaGo’s move
102. It is aggressive—a gateway to complicated skirmishes. Lee recoils,
rubbing the back of his neck. He looks worried. He strengthens his
resolve and rejoins battle. Eighty-four moves later, Lee resigns. The
reaction in the DeepMind team room is euphoric.
Afterwards, Lee and a composed Hassabis face the assembled media
at the post-game press conference. The two sit apart on stools on a bare
stage. Lee looks isolated, lost, abandoned. He is deeply disappointed but
accepts his loss with grace. The following morning, AlphaGo’s victory
is front page news.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Day two, game two. This time, Lee knows what to expect. He plays
more cautiously. On move 37, AlphaGo makes an unexpected play—
a move that humans seldom play. In shock, Lee walks out of the
conference room. Huang and the match judges stay put, bewildered.
Minutes later, having collected his thoughts, Lee returns to the fray.
After 211 moves, Lee again resigns.
AlphaGo’s move 37 was decisive. The computer estimated that the
chances of a human playing the move was one in ten thousand. The
European Go champion Fan Hui was awestruck. For him, move 37
was, ‘So beautiful. So beautiful.’ AlphaGo had displayed insight beyond
human expertise. The machine was creative.
At the press conference, Lee reflected on the game: 242
Yesterday, I was surprised. But today I am speechless. If you look at the
way the game was played, I admit, it was a very clear loss on my part.
From the very beginning of the game, there was not a moment in time
when I felt that I was leading.
Day three, game three. Lee’s facial expressions say it all—initial calm,
turning to concern, followed by agony, and finally dismay. He resigns
after four hours of play. Against all expectations—save those of Google
and DeepMind—AlphaGo wins the match.
Lee looks worn out. Regardless, he is gracious in defeat: 243
I apologize for being unable to satisfy a lot of people’s expectations.
I kind of felt powerless.
A strange kind of melancholy descends on proceedings. Everyone is
affected, even the DeepMind team. Those present are witnessing the
suffering of a great man. One of Lee’s rivals remarks that Lee had
fought: 238
A very lonely battle against an invisible opponent.
Even though the series is decided, Lee and AlphaGo press on, playing
games four and five. In game four, Lee is more himself. The Strong Stone
takes a high-risk strategy. His move 78—a so-called ‘wedge’ play—is
later referred to by commentators as ‘God’s move’. AlphaGo’s response
is disastrous for the machine. Soon its play becomes rudderless. Finding
no way out, the computer begins to make nonsense moves. Eventually,
AlphaGo resigns.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
Lee tries the same high-risk approach in game five. This time, there is
no miracle play. Lee is forced to resign.
AlphaGo wins the match by 4 games to 1.
DeepMind
To outsiders, it must have seemed that DeepMind was an overnight
success but, of course, it wasn’t. Demis Hassabis, the company’s co-
founder and CEO, had been thinking about board games and computers
since he was a kid.
Hassabis (Figure 12.1) was born in London, England, in 1976. He is
proudly ‘North London born and bred’. 245 Hassabis reached master
level in Chess aged 13. He spent his winnings on his first computer—
a Sinclair Spectrum 48K—and taught himself to program. Before long,
he had completed his first Chess-playing program.
Hassabis finished high school at sixteen and joined a video game
development company (Lionhead Studios). A year later, he was co-
designer and lead-programmer on the popular management simula-
tion game Theme Park. Hassabis left the company to enrol in a Computer
Figure 12.1 DeepMind co-founder and CEO Demis Hassabis, 2018. (Courtesy
DeepMind.)
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
DeepMind 211
possible. At first, the network played randomly. Through trial and error
and a learning algorithm, it gradually accumulated a suite of point-
scoring tactics. By the end of training, DeepMind’s neural network was
better at Space Invaders than any previous algorithm. This, in itself, was
an achievement. What was remarkable was that the network went on
to learn how to play forty-nine different Atari video games. The games
were varied, requiring different skills. Not only could the network play
the games, it could play them just as well as a professional human games
tester. This was new. DeepMind’s ANN had excelled across a range
of tasks. For the first time, an ANN was showing a general-purpose
learning capability.
A year later—and just two months before the Lee match—
DeepMind published another paper in Nature. In it, they described
AlphaGo and casually mentioned that the program had beaten the
European Go champion Fan Hui. The paper should have been a
warning to Lee Sedol and others. However, Europe was regarded as
a Go backwater. It was presumed that Fan Hui had erred. Fan Hui, for
his part, was so impressed with AlphaGo that he accepted an offer to
act as a consultant to the DeepMind team as they prepared for the Lee
Sedol match.
AlphaGo’s resounding victory over Lee garnered headlines world-
wide. In contrast, AlphaGo’s later defeat of the world number one was
an anticlimax. AlphaGo beat nineteen-year-old Ke Jie 3–0 in May 2017.
This time, the match received little media coverage. The world seemed
to have accepted humankind’s defeat and moved on. After the win,
Hassabis said that, based on AlphaGo’s analysis, Jie had played almost
perfectly. Almost perfect was no longer good enough. After the match,
DeepMind retired AlphaGo from competitive play.
Yet the company didn’t stop working on computers that played Go. It
published another paper in Nature, describing a new neural network pro-
gram dubbed AlphaGo Zero. AlphaGo Zero employed a reduced tree
search and just one neural network. This single two-headed network
replaced the policy and value networks of its predecessor. AlphaGo
Zero used a new, more efficient training procedure based exclusively on
reinforcement learning. Gone was the need for a database of human
moves. AlphaGo Zero taught itself how to play Go from scratch in a
meagre forty days. In that time, it played twenty-nine million games.
The machine was allowed just five seconds of processing time between
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
DeepMind 213
moves. AlphaGo Zero was tested against the version of AlphaGo that
beat Ke Jie. AlphaGo Zero won 100 games to nil.
In just forty days the computer had taught itself to play Go better
than any human, ever. AlphaGo Zero was emphatically superhuman.
Human Go grandmasters pored over AlphaGo Zero’s moves. They
discovered that AlphaGo Zero employed previously unknown game-
winning strategies. Ke Jie began to include the new tactics in his own
repertoire. A new age in the history of Go was dawning. Human grand-
masters were now apprentices to the machine. Biological neural net-
works were learning from their artificial creations.
The true significance of AlphaGo Zero doesn’t lie on a Go board, how-
ever. Its real importance lies in the fact that AlphaGo Zero is a prototype
for a general-purpose problem solver. The algorithms embedded in its
software can be applied to other problems. This capability will allow
ANNs to rapidly take on new tasks and solve problems that they haven’t
seen before—something that heretofore only humans and high-level
mammals have achieved.
The first signs of this general problem-solving capacity appeared in
2018 in yet another Nature paper. This time, the DeepMind team trained
an ANN named AlphaZero to play Go, Chess, and Shogi (Japanese
Chess). It wasn’t particularly surprising to read that AlphaZero learned
how to play the three games solely from self-play. Neither was it es-
pecially eyebrow raising to note that AlphaZero defeated the previ-
ous world-champion programs (Stockfish, Elmo, AlphaGo Zero) at all
three games. What was jaw dropping was that, starting from random
play, AlphaZero learned to play Chess in just nine hours, Shogi in
twelve hours, and Go in thirteen days. The human mind was beginning
to appear weak in comparison.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
13
Next Steps
Cryptocurrency
The first of these is the algorithm that underpins cryptocurrency. Cryp-
tocurrency is a form of money that only exists as information held in
a computer network. The world’s first cryptocurrency, Bitcoin, now
has over seventeen million ‘coins’ in circulation with a total real-world
value of $200 billion (2019). Cryptocurrencies seem set to disrupt the
global financial system.
The origins of cryptocurrency lie in the Cypherpunk movement that
began in the 1990s. The Cypherpunks are a loose amalgam of skilled
cryptographers, mathematicians, programmers, and hackers who be-
lieve passionately in the need for electronic privacy. Connected by
mailing lists and online discussion groups, the Cypherpunks develop
open-source software that users can deploy for free to secure their data
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Bitcoin 217
Bitcoin
Satoshi Nakamoto announced a solution to the Double-Spend Problem
on 31 October 2008 in a white paper posted to a Cypherpunk mailing
list. The paper introduced Bitcoin, the world’s first practical cryptocur-
rency. The following January, Nakamoto released the Bitcoin source
code and the original—or genesis—Bitcoin block.
Fundamentally, bitcoins are just sequences of characters (numbers
and letters) held in a computer network. A bitcoin only has value
because people believe that it has value. Users expect that they will be
able to exchange bitcoins for goods and services at a later date. In this,
Bitcoin is no different from the banknotes in your pocket. The paper
itself has little intrinsic value. Its value derives from the expectation that
you will be able to exchange it for something of worth.
Bitcoin is reasonably straightforward to use. Users buy, sell, and
exchange bitcoins via apps. Bitcoins can be used to purchase real-world
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Bitcoin 219
Blockchain
Next, the transaction is confirmed and logged. The network computers
incorporate the new transaction in a larger block of unconfirmed trans-
actions. A block is simply a group of unconfirmed transactions and their
associated data. The network computers race to add their blocks to the
ledger. The winner of the race incorporates its block in the chain of
blocks. This chain of blocks—or Blockchain—is the ledger (Figure 13.3).
It links every confirmed Bitcoin block in an unbroken sequence stretch-
ing all the way back to Nakamoto’s genesis block. The links of the chain
are formed by including the ID of the previous block in the next block.
The chain rigidly defines the order in which transactions are applied
to the ledger. The transactions in a single block are considered to have
happened at the same time. Transactions in any previous block are
Blockchain 221
Who Is Nakamoto?
The really curious thing about Bitcoin is that no one knows who Satoshi
Nakamoto—Bitcoin’s inventor—is. The first mention of Nakamoto
was the release of the original Bitcoin white paper. Nakamoto remained
active on the Cypherpunk mailing lists for a few years, then in 2010
Nakamoto passed control of the Bitcoin source code to Gavin Andresen.
The following April, Nakamoto declared: 260
I’ve moved on to other things.
It’s in good hands with Gavin and everyone.
Save for a handful of messages—most of which are now thought to be
hoaxes—that was the last anyone heard from Nakamoto.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Quantum Computers
Bitcoin draws heavily on RSA public key cryptography to ensure user
anonymity and provide transaction authentication. In turn, the secu-
rity of the RSA algorithm hinges on the assumption that there is no fast
algorithm for prime factorization of large numbers (see Chapter 7). In
other words, there is no fast method for determining which two prime
numbers were multiplied to produce a given large number. Clearly, the
prime factors of 21 are 3 and 7, but this determination is only quick
because 21 is small. Prime factorization of a large prime can take decades
on a supercomputer.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
moment in time, the voltage level on each wire has a single value and
so represents a single binary digit (or bit). Therefore, calculations have
to be performed one after another.
In contrast, quantum computers represent information using the
properties of subatomic, or quantum, particles. Various physical prop-
erties of subparticles can be used. One option is the spin of an electron.
An upward spin might represent a 1, downward a 0. The big advantage
of using the properties of subatomic particles is that, in the quantum
world, particles can exist in multiple states at the same time. This
strange behaviour is encapsulated in the principle of superposition. The effect
was uncovered by physicists in the early part of the twentieth century.
An electron can spin with all possible orientations simultaneously.
Exploiting this effect to represent data means that a single electron can
represent 0 and 1 simultaneously. This phenomena gives rise to the
basic unit of information in a quantum computer—the quantum bit,
or qubit.
Quantum computers become exponentially more powerful as
qubits are added. A single qubit can represent two values—0 and
1—simultaneously. Two qubits allow four values—00, 01, 10, and 11—
to be represented at the same time. A ten-qubit system can capture
all decimal values from 0 to 1,023, inclusive, simultaneously. When a
quantum computer performs an operation, it is applied to all states
at the same time. For example, adding one to a ten-qubit system
performs 1,024 additions at once. On a conventional computer, these
1,024 additions would have to be performed one after another. This
effect bestows on quantum computers the potential for exponential
acceleration in calculation.
There is a snag, though. Measuring the value of a qubit collapses its
state. This means that the qubit settles to a single value when its physical
state is measured. Thus, even though a ten-qubit system can perform
1,024 additions simultaneously, only one result can be retrieved. Worse
again, the output retrieved is selected at random from the 1,024 pos-
sibilities. The collapsed result of the addition of 1 could be any value
from 1 to 1,024. Clearly, random selection of the output is not desirable.
Mostly, we want to input some data and recover a particular result. The
solution to this problem is an effect known as interference. It is sometimes
possible to force unwanted states to destructively interfere with each
another. In this way, the unwanted results can be removed, leaving the
single, desired outcome behind.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
produces the prime factors in ten or fewer iterations (see the Appendix
for more details).
On a conventional computer, multiplying the guess by itself over and
over again is very slow. The loop has to be repeated a large number of
times before any pattern appears. On a quantum computer, these mul-
tiplications can be performed simultaneously, thanks to superposition.
After that, a quantum Fourier transform can be used to cancel all but
the strongest repeating pattern. This gives the period of the remainder
sequence, which can be collapsed and measured. Euclid’s algorithm
is then performed on a conventional computer. Superposition and
interference allow the quantum computer to perform Shor’s algorithm
amazingly quickly.
Teams at Google, IBM, Microsoft, and a handful of start-ups are now
chasing the quantum computing dream. Their devices bear greater
resemblance to large physics experiments than supercomputers. Build-
ing a quantum computer requires design and subatomic fabrication
of quantum logic gates. Measuring and controlling the state of the
subatomic particles requires incredibly precise equipment. To perform
reliable measurements, the qubits must be cooled to near absolute zero
(–273°C).
To date, computation with up to seventy-two qubits has been
demonstrated. In theory, seventy-two qubits should provide immense
computing power. However, in practice, quantum noise affects perfor-
mance. Minute fluctuations in the state of the subatomic particles can
lead to errors in computation. Teams compensate for this by dedicating
some of the qubits to error correction (see Chapter 7). The downside
is that fewer qubits are available for computation. On the face of it,
the solution appears to be straightforward—simply add more qubits.
However, there is a worry. What if more qubits just mean more noise
and errors? What if none are available for computation?
In October 2019, a team from Google claimed that their quantum
computer had attained quantum supremacy. The group stated that the
computer had performed a computation that could not conceivably
be completed on a conventional computer. The program checked that
the output of a quantum random number generator was truly random.
Their Sycamore quantum computing chip completed the task in 200
seconds using fifty-three qubits. The team estimated that the same
calculation would take more than 10,000 years on a supercomputer.
IBM begged to differ. They calculated that the task could be performed
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Appendix
PageRank Algorithm
Take the table of link counts as input.
Calculate the PageRanks as the number of incoming links for a page
divided by the average number of incoming links.
Repeat the following:
Repeat the following for every column:
Set a running total to zero.
Repeat the following for every entry in the column:
Look up the current PageRank for the row.
Multiply by the number of links between the row and
column.
Divide by the total number of outgoing links for the row.
Multiply by the damping factor.
Add to the running total.
Stop repeating when all entries in the column have been
processed.
Add the damping term to the running total.
Store this value as the new PageRank for the column.
Stop repeating when all columns have been processed.
Stop repeating when the change in the PageRanks is small.
Output the PageRanks.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
230 Appendix
Appendix 231
Bitcoin Algorithm
The bitcoin sender:
Creates a transaction recording the sender’s public key, the
receiver’s public key, the amount, and the IDs of the inputs to
the transaction.
Appends a digital signature to the transaction.
Broadcasts the signed transaction to the Bitcoin network.
The computers on the Bitcoin network:
Check that the signature is authentic.
Check that the input transactions have not been spent.
Incorporate the transaction in a candidate block.
Link the candidate to the chain.
Repeat the following steps:
Generate a random number and append it to the block.
Calculate the hash for the block.
Stop repeating when the hash is less than the threshold or
abandon the hunt when another computer wins the race.
Broadcast the valid block to the network.
The bitcoin receiver:
Accepts the transaction when it and five more blocks have been
added to the chain.
OUP CORRECTED PROOF – FINAL, 12/7/2020, SPi
232 Appendix
Shor’s Algorithm
Take a large number as input.
Repeat the following steps:
Take a prime number as a guess.
Store the guess in memory.
Create an empty list.
Repeat the following steps:
Multiplying the value in memory by the guess.
Update the value in memory.
Calculate the remainder after dividing the input by the value
in memory.
Append this remainder to the list.
Stop repeating after a large number of repetitions.
Apply the Fourier transform to the list of remainders.
Identify the period of the strongest harmonic.
Calculate the guess to the power of the period divided by two, all
minus one.
Apply Euclid’s algorithm to this value and the input.
Stop repeating when the value returned is a prime factor of the input.
Divide the input by the prime factor.
Output both prime factors.
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Notes
Introduction
(Page 1) Strictly speaking, even addition is an algorithm.
(Page 1) A common misconception is that the word ‘algorithm’ is synonymous
with ‘method’. The two words are not equivalent. A method is a series of steps.
An algorithm is a series of steps that solve an information problem.
(Page 3) In this example, the books are considered as symbols representing the
titles of the books. Rearranging the books—the symbols—has the effect of
sorting the titles.
234 Notes
(Page 28) Archimedes did not have the benefit of the sine, cosine, and tangent
trigonometric functions that we use today. The length of a side of the inner
hexagon is 2r sin( π6 ). The angle is the angle from the centre to the bisection of
the side. The length of a side of the outer hexagon is 2r tan( π6 ).
(Page 29) Archimedes’ algorithm was finally supplanted by calculations based
on infinite series.
(Page 30) Quadratic algorithms are of the form ax2 + bx + c = 0 where a, b,
and c are known constants, or coefficients, and x is the unknown value to be
determined.
(Page 30) The Compendious Book on Calculation by Completion and Balancing was translated
into Latin by Robert of Chester around 1145.
(Page 31) Other cultures developed decimal numbers systems, including the
Chinese and Egyptians. However, they used different numerals (digit represen-
tations) and, by and large, used alternative positional systems.
(Page 33) More precisely, Fourier claimed that any function of a variable could
be expressed as the summation of a series of sinusoidal functions whose periods
are power of two divisors of the period of the original function. The Fourier
series had been previously used by Leonhard Euler, Joseph Louis Lagrange, and
Carl Friedrich Gauss. However, Fourier’s work served to popularize the concept
and was the basis of later work.
(Page 35) In the Fourier transform example, I omit the DC (constant) compo-
nent for simplicity.
(Page 37) Tukey also has the distinction of coining the terms ‘software’ and ‘bit’.
Notes 235
(Page 45) A section of the Analytic Engine was assembled and is now in the
Science Museum, London.
(Page 45) Macabrely, half of Babbage’s brain is on display in the Science Museum,
London. The other half is held in the Royal College of Surgeons. Menabrea,
author of the original paper on the Analytic Engine, went on to become Prime
Minister of Italy (1867–1869).
(Page 50) Turing’s supervisor at Princeton, Alonzo Church, came up with an
alternative, calculus-based proof at roughly the same time. Turing’s proposal
was closely related to earlier work by Kurt Gödel.
(Page 52) Turing’s original description of the Turing Test, rather oddly, equates
differentiating between a computer and human with differentiating between
a man and woman. One wonders if there was a subtext regarding his own
homosexuality.
(Page 53) It has been reported that the Apple logo was emblematic of the apple
found by Turing’s bedside. When asked, Steve Jobs replied that it wasn’t, but he
wished it had been.
(Page 54) A hack to make the Z3 Turing Complete was published in 1998.
(Page 54) Under the direction of George Stibitz, Bell Labs also developed a relay-
based calculator.
236 Notes
Notes 237
US were awarded on the same day—7 June 1965. The recipients were Sister
May Kemmar at the University of Wisconsin and Irving Tang at Washington
University in St. Louis.
(Page 111) Fisher dedicated his book to Darwin’s son, Leonard Darwin, with
whom Fisher had a long friendship and who provided much support in the
writing of the book.
(Page 113) Paradoxically, Holland used the success of natural evolution to
justify his work on genetic algorithms, whereas biologists employed Holland’s
algorithms to support their arguments for the existence of natural evolution.
238 Notes
Notes 239
Permissions
Bibliography
1. Hoare, C.A.R., 1962. Quicksort. The Computer Journal, 5(1), pp. 10–16.
2. Dalley, S., 1989. Myths from Mesopotamia. Oxford: Oxford University Press.
3. Finkel, I., 2014. The Ark before Noah. Hachette.
4. Rawlinson, H.C., 1846. The Persian cuneiform inscription at Behistun,
decyphered and translated. Journal of the Royal Asiatic Society of Great Britain
and Ireland, 10, pp. i–349.
5. Knuth, D.E., 1972. Ancient Babylonian algorithms. Communications of the
ACM, 15(7), pp. 671–7.
6. Fowler, D. and Robson, E., 1998. Square root approximations in old
Babylonian mathematics: YBC 7289 in context. Historia Mathematica, 25(4),
pp. 366–78.
7. Fee, G.J., 1996. The square root of 2 to 10 million digits. http://www.plouffe.
fr/simon/constants/sqrt2.txt. (Accessed 5 July 2019).
8. Harper, R.F., 1904. The Code of Hammurabi, King of Babylon. Chicago: The
University of Chicago Press.
9. Jaynes, J., 1976. The Origin of Consciousness in the Breakdown of the Bicameral Mind.
New York: Houghton Mifflin Harcourt.
10. Boyer, C.B. and Merzbach. U.C., 2011. A History of Mathematics. Oxford: John
Wiley & Sons.
11. Davis, W.S., 1913. Readings in Ancient History, Illustrative Extracts from the Source:
Greece and the East. New York: Allyn and Bacon.
12. Beckmann, P., 1971. A history of Pi. Boulder, CO: The Golem Press.
13. Mackay, J.S., 1884. Mnemonics for π , π1 , e. Proceedings of the Edinburgh Mathe-
matical Society, 3, pp. 103–7.
14. Dietrich, L., Dietrich, O., and Notroff, J., 2017. Cult as a driving force of
human history. Expedition Magazine, 59(3), pp. 10–25.
15. Katz, V., 2008. A History of Mathematics. London: Pearson.
16. Katz, V. J. ed., 2007. The Mathematics of Egypt, Mesopotamia, China, India, and Islam.
Princeton, NJ: Princeton University Press.
17. Palmer, J., 2010. Pi record smashed as team finds two-quadrillionth
digit – BBC News [online]. https://www.bbc.com/news/technology-
11313194, September 16 2010. (Accessed 6 January 2019).
18. The Editors of Encyclopaedia Britannica, 2020. Al-Khwārizmī. In Ency-
clopaedia Britannica [online]. https://www.britannica.com/biography/al-
Khwarizmi (Accessed 20 May 2020).
19. LeVeque, W.J. and Smith, D.E., 2019. Numerals and numeral systems. In
Encyclopaedia Britannica [online]. https://www.britannica.com/science/
numeral. (Accessed 19 May 2020).
20. The Editors of Encyclopaedia Britannica, 2020. French revolution. In Ency-
clopaedia Britannica [online]. https://www.britannica.com/event/French-
Revolution. (Accessed 19 May 2020).
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
244 Bibliography
21. Cooley, J.W. and Tukey, J.W., 1965. An algorithm for the machine
calculation of complex Fourier series. Mathematics of Computation, 19(90),
pp. 297–301.
22. Rockmore, D.N., 2000. The FFT: An algorithm the whole family can use.
Computing in Science & Engineering, 2(1), pp. 60–4.
23. Anonymous, 2016. James William Cooley. New York Times.
24. Heidenman, C., Johnson, D., and Burrus, C., 1984. Gauss and the history
of the fast Fourier transform. IEEE ASSP Magazine, 1(4), pp. 14–21.
25. Huxley, T.H., 1887. The Advance of Science in the Last Half-Century. New York:
Appleton and Company.
26. Swade, D., 2000. The Cogwheel Brain. London: Little, Brown.
27. Babbage, C., 2011. Passages from the Life of a Philosopher. Cambridge: Cambridge
University Press.
28. Menabrea, L.F. and King, A., Countess of Lovelace, 1843. Sketch of
the analytical engine invented by Charles Babbage. Scientific Memoirs, 3,
pp. 666–731.
29. Essinger, J., 2014. Ada’s algorithm: How Lord Byron’s daughter Ada Lovelace Launched
the Digital Age. London: Melville House.
30. Kim, E.E. and Toole, B.A., 1999. Ada and the first computer. Scientific
American, 280(5), pp. 76–81.
31. Isaacson, W., 2014. The Innovators. New York: Simon and Schuster.
32. Turing, S., 1959. Alan M. Turing. Cambridge: W. Heffer & Sons, Ltd.
33. Turing, A.M., 1937. On computable numbers, with an application to the
Entscheidungsproblem. Proceedings of the London Mathematical Society, s2–42(1),
pp. 230–65.
34. Davis, M., 1983. Computability and Unsolvability. Mineola, NY: Dover Publica-
tions.
35. Strachey, C., 1965. An impossible program. The Computer Journal, 7(4), p. 313.
36. Turing, A.M., 1950. Computing machinery and intelligence. Mind, 59(236),
pp. 433–60.
37. Copeland, B. Jack., 2014. Turing. Oxford: Oxford University Press.
38. Abbe, C., 1901. The physical basis of long-range weather forecasts. Monthly
Weather Review, 29(12), pp. 551–61.
39. Lynch, P., 2008. The origins of computer weather prediction and climate
modeling. Journal of Computational Physics, 227(7), pp. 3431–44.
40. Hunt, J.C.R., 1998. Lewis Fry Richardson and his contributions to mathe-
matics, meteorology, and models of conflict. Annual Review of Fluid Mechanics,
30(1), pp. xiii–xxxvi.
41. Mauchly, J.W., 1982. The use of high speed vacuum tube devices for cal-
culating. In: B. Randall, ed., The Origins of Digital Computers. Berlin: Springer,
pp. 329–33.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 245
42. Fritz, W.B., 1996. The women of ENIAC. IEEE Annals of the History of Comput-
ing, 18(3), pp. 13–28.
43. Ulam, S., 1958. John von Neumann 1903–1957. Bulletin of the American Math-
ematical Society, 64(3), pp. 1–49.
44. Poundstone, W., 1992. Prisoner’s Dilemma. New York: Doubleday.
45. McCorduck, P., 2004. Machines Who Think. Natick, MA: AK Peters.
46. Goldstein, H.H., 1980. The Computer from Pascal to von Neumann. Princeton, NJ:
Princeton University Press.
47. Stern, N., 1977. An Interview with J. Presper Eckert. Charles Babbage Institute,
University of Minnesota.
48. Von Neumann, J., 1993. First draft of a report on the EDVAC. IEEE Annals
of the History of Computing, 15(4), pp. 27–75.
49. Augarten, S., 1984. A. W. Burks, ‘Who invented the general-purpose
electronic computer?’ In Bit by bit: An Illustrated History of Computers. New
York: Ticknor & Fields. Epigraph, Ch. 4.
50. Kleiman, K., 2014. The computers: The remarkable story of the
ENIAC programmers. Vimeo [online]. https://vimeo.com/ondemand/
eniac6. (Accessed 11 March 2019).
51. Martin, C.D., 1995. ENIAC: Press conference that shook the world. IEEE
Technology and Society Magazine, 14(4), pp. 3–10.
52. Nicholas Metropolis. The beginning of the Monte Carlo method. Los
Alamos Science, 15(584), pp. 125–30.
53. Eckhardt, R., 1987. Stan Ulam, John von Neumann, and the Monte Carlo
method. Los Alamos Science, 15(131–136), p. 30.
54. Wolter, J., 2013. Experimental analysis of Canfield solitaire. http://politaire.
com/article/canfield.html. (Accessed 20 May 2020).
55. Metropolis, N. and Ulam, S., 1949. The Monte Carlo method. Journal of the
American Statistical Association, 44(247), pp. 335–341.
56. Charney, J.G. and Eliassen, A., 1949. A numerical method for predicting
the perturbations of the middle latitude westerlies. Tellus, 1(2), pp. 38–54.
57. Charney, J.G., 1949. On a physical basis for numerical prediction of large-
scale motions in the atmosphere. Journal of Meteorology, 6(6), pp. 372–85.
58. Platzman, G.W., 1979. The ENIAC computations of 1950: Gateway to
numerical weather prediction. Bulletin of the American Meteorological Society,
60(4), pp. 302–12.
59. Charney, J.G., Fjörtoft, R., and von Neumann, J., 1950. Numerical inte-
gration of the barotropic vorticity equation. Tellus, 2(4), pp. 237–54.
60. Blair, C., 1957. Passing of a great mind. Life Magazine, 42(8), pp. 89–104.
61. Lorenz, E.N., 1995. The Essence of Chaos. Seattle: University of Washington
Press.
62. Lorenz, E.N., 1963. Deterministic nonperiodic flow. Journal of the Atmospheric
Sciences, 20(2), pp. 130–41.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
246 Bibliography
63. Epstein, E.S., 1969. Stochastic dynamic prediction. Tellus, 21(6), pp. 739–59.
64. European Centre for Medium-Range Weather Forecasts, 2020. Advancing
global NWP through international collaboration. http://www.ecmwf.int.
(Accessed 19 May 2020).
65. Lynch, P. and Lynch, O., 2008. Forecasts by PHONIAC. Weather, 63(11),
pp. 324–6.
66. Shannon, C.E., 1950. Programming a computer for playing chess. Philosoph-
ical Magazine, 41(314), pp. 256–75.
67. National Physical Laboratory, 2012. Piloting Computing: Alan Turing’s
Automatic Computing Engine. YouTube [online]. https://www.youtube.
com/watch?v=cEQ6cnwaY_s. (Accessed 27 October 2019).
68. Campbell-Kelly, M., 1985. Christopher Strachey, 1916–1975: A biographi-
cal note. Annals of the History of Computing, 7(1), pp. 19–42.
69. Copeland, J. and Long, J., 2016. Restoring the first recording of com-
puter music. https://blogs.bl.uk/sound-and-vision/2016/09/restoring-the-
first-recording-of-computer-music.html#. (Accessed 15 February 2019).
70. Foy, N., 1974. The word games of the night bird (interview with Christo-
pher Strachey). Computing Europe, 15, pp. 10–11.
71. Roberts, S., 2017. Christopher Strachey’s nineteen-fifties love machine.
The New Yorker, February 14.
72. Strachey, C., 1954. The ‘thinking’ machine. Encounter, III, October.
73. Strachey, C.S., 1952. Logical or non-mathematical programmes. In Proceed-
ings of the 1952 ACM National Meeting New York: ACM. pp. 46–9.
74. McCarthy, J., Minsky, M.L., Rochester, N., and Shannon, C.E., 2006. A
proposal for the Dartmouth summer research project on artificial intelli-
gence, August 31, 1955. AI Magazine, 27(4), pp. 12–14.
75. Newell, A. and Simon, H., 1956. The logic theory machine: A complex
information processing system. IRE Transactions on Information Theory. 2(3),
pp. 61–79.
76. Newell, A., Shaw, J.C., and Simon, H.A., 1959. Report on a general prob-
lem solving program. In Proceedings of the International Conference on Information
Processing. Paris: UNESCO. pp. 256–64.
77. Newell, A. and Simon, H., 1972. Human Problem Solving. New York: Prentice-
Hall.
78. Schaeffer, J., 2008. One Jump Ahead: Computer Perfection at Checkers. New York:
Springer.
79. Samuel, A.L., 1959. Some studies in machine learning using the game of
checkers. IBM Journal of Research and Development, 3(3), pp. 210–29.
80. McCarthy, J. and Feigenbaum, E.A., 1990. In memoriam: Arthur Samuel:
Pioneer in machine learning. AI Magazine, 11(3), p. 10.
81. Samuel, A.L., 1967. Some studies in machine learning using the game of
checkers. ii. IBM Journal of Research and Development, 11(6), pp. 601–17.
82. Madrigal, A.C., 2017. How checkers was solved. The Atlantic. July 19.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 247
83. Simon, H.A., 1998. Allen Newell: 1927–1992. IEEE Annals of the History of
Computing, 20(2), pp. 63–76.
84. CBS, 1961. The thinking machine. YouTube [online]. https://youtu.be/
aygSMgK3BEM. (Accessed 19 May 2020).
85. Dreyfus, H.L., 2005. Overcoming the myth of the mental: How philoso-
phers can profit from the phenomenology of everyday expertise. In:
Proceedings and Addresses of the American Philosophical Association, 79(2), pp. 47–65.
86. Nilsson, N.J., 2009. The Quest for Artificial Intelligence. Cambridge: Cambridge
University Press.
87. Schrijver, A., 2005. On the history of combinatorial optimization (till
1960). In: K. Aardal, G.L. Nemhauser, R. Weismantel, eds., Discrete optimiza-
tion, vol. 12. Amsterdam: Elsevier. pp. 1–68.
88. Dantzig, G., Fulkerson, R., and Johnson, S., 1954. Solution of a large-scale
traveling-salesman problem. Journal of the Operations Research Society of America,
2(4), pp. 393–410.
89. Cook, W., n.d. Traveling salesman problem. http://www.math.uwaterloo.
ca/tsp/index.html. (Accessed 19 May 2020).
90. Cook, S.A., 1971. The complexity of theorem-proving procedures. In:
Proceedings of the 3rd annual ACM Symposium on Theory of Computing. New York:
ACM. pp. 151–8.
91. Karp, R., n.d. A personnal view of Computer Science at Berkeley.
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html.
(Accessed 15 February 2019).
92. Garey, M.R. and Johnson, D.S., 1979. Computers and Intractability. New York:
W. H. Freeman and Company.
93. Dijkstra, E.W., 1972. The humble programmer. Communications of the ACM,
15(10), pp. 859–66.
94. Dijkstra, E.W., 2001. Oral history interview with Edsger W. Dijkstra. Tech-
nical report, Charles Babbage Institute, August 2.
95. Dijkstra, E.W., 1959. A note on two problems in connexion with graphs.
Numerische mathematik, 1(1), pp. 269–71.
96. Darrach, B., 1970. Meet Shaky: The first electronic person. Life Magazine,
69(21):58B–68B.
97. Hart, P.E., Nilsson, N.J., and Raphael, B., 1968. A formal basis for the
heuristic determination of minimum cost paths. IEEE Transactions on Systems
Science and Cybernetics, 4(2), pp. 100–7.
98. Hitsch, G.J., Hortaçsu, A., and Ariely, D., 2010. Matching and sorting in
online dating. The American Economic Review, 100(1), pp. 130–63.
99. NRMP. National resident matching program. http://www.nrmp.org.
(Accessed 19 May 2020).
100. Roth, A.E., 2003. The origins, history, and design of the resident match.
Journal of the American Medical Association, 289(7), pp. 909–12.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
248 Bibliography
101. Roth, A.E., 1984. The evolution of the labor market for medical interns
and residents: A case study in game theory. The Journal of Political Economy,
92, pp. 991–1016.
102. Anonymous, 2012. Stable matching: Theory, evidence, and practical de-
sign. Technical report, The Royal Swedish Academy of Sciences.
103. Kelly, K., 1994. Out of Control. London: Fourth Estate.
104. Vasbinder, J.W., 2014. Aha... That is Interesting!: John H. Holland, 85 years young.
Singapore: World Scientific.
105. London, R.L., 2013. Who earned first computer science Ph.D.? Communica-
tions of the ACM : blog@CACM, January.
106. Scott, N.R., 1996. The early years through the 1960’s: Computing at the
CSE@50. Technical report, University of Michigan.
107. Fisher, R.A., 1999. The Genetical Theory of Natural Selection. Oxford: Oxford
University Press.
108. Holland, J.H., 1992. Adaptation in Natural and Artificial Systems. Cambridge, MA:
The MIT Press.
109. Holland, J.H., 1992. Genetic algorithms. Scientific American, 267(1),
pp. 66–73.
110. Dawkins, R., 1986. The Blind Watchmaker. New York: WW Norton & Com-
pany.
111. Lohn, J.D., Linden, D.S., Hornby, G.S., Kraus, W.F., 2004. Evolutionary
design of an X-band antenna for NASA’s space technology 5 mission. In:
Proceedings of the IEEE Antennas and Propagation Society Symposium 2004, volume 3.
Monterey, CA, 20–25 June, pp. 2313–16. New York: IEEE.
112. Grimes, W., 2015. John Henry Holland, who computerized evolution, dies
at 86. New York Times, August 19.
113. Licklider, J.C.R., 1960. Man-computer symbiosis. IRE Transactions on Human
Factors in Electronics, 1(1), pp. 4–11.
114. Waldrop, M.M., 2001. The Dream Machine. London: Viking Penguin.
115. Kita, C.I., 2003. JCR Licklider’s vision for the IPTO. IEEE Annals of the History
of Computing, 25(3), pp. 62–77.
116. Licklider, J.C.R., 1963. Memorandum for members and affiliates of the
intergalactic computer network. Technical report, Advanced Research
Projects Agency, April 23.
117. Licklider, J.C.R., 1965. Libraries of the Future. Cambridge, MA: The MIT Press.
118. Licklder, J.C.R. and Taylor, R.W., 1968. The computer as a communication
device. Science and Technology, 76(2), pp. 1–3.
119. Markoff, J., 1999. An Internet pioneer ponders the next revolution. The New
York Times, December 20.
120. Featherly, K., 2016. ARPANET. In Encyclopedia Brittanica [online].
https://www.britannica.com/topic/ARPANET. (Accessed 19 May 2020).
121. Leiner, B.M., Cerf, V.G., Clark, D.D., Kahn, R.E., Kleinrock., L., Lynch,
D.C., Postel, J., Robers, L.G., and Wolff, S., 2009. A brief history of the
Internet. ACM SIGCOMM Computer Communication Review, 39(5), pp. 22–31.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 249
122. Davies, D.W., 2001. An historical study of the beginnings of packet switch-
ing. The Computer Journal, 44(3), pp. 152–62.
123. Baran, P., 1964. On distributed communications networks. IEEE Transactions
on Communications Systems, 12(1), pp. 1–9.
124. McQuillan, J., Richer, I., and Rosen, E., 1980. The new routing algorithm
for the ARPANET. IEEE Transactions on Communications, 28(5), pp. 711–19.
125. McJones, P., 2008. Oral history of Robert (Bob) W. Taylor. Technical
report, Computer History Museum.
126. Metz, C., 2012. Bob Kahn, the bread truck, and the Internet’s first com-
munion. Wired, August 13.
127. Vint, C. and Kahn, R., 1974. A protocol for packet network interconnec-
tion. IEEE Transactions of Communications, 22(5), pp. 637–48.
128. Metz, C., 2007. How a bread truck invented the Internet. The Register
[online]. https://www.theregister.co.uk/2007/11/12/thirtieth_anniversary
_of_first_internet_connection/. (Accessed 19 May 2020).
129. Anonymous, 2019. Number of internet users worldwide from 2005 to
2018. Statista [online]. https://www.statista.com/statistics/273018/number-
of-internet-users-worldwide/. (Accessed 19 May 2020).
130. Lee, J., 1998. Richard Wesley Hamming: 1915–1998. IEEE Annals of the History
of Computing, 20(2), pp. 60–2.
131. Suetonius, G., 2009. Lives of the Caesars. Oxford: Oxford University Press.
132. Singh, S., 1999. The Code Book: The Secret History of Codes & Code-breaking.
London: Fourth Estate.
133. Diffie, W. and Hellman, M., 1976. New directions in cryptography. IEEE
Transactions on Information Theory, 22(6), pp. 644–54.
134. Rivest, R.L., Shamir, A., Adleman, L., 1978. A method for obtaining digital
signatures and public-key cryptosystems. Communications of the ACM, 21(2),
pp. 120–6.
135. Gardner, M., 1977. New kind of cipher that would take millions of years
to break. Scientific American. 237(August), pp. 120–4.
136. Atkins, D., Graff, M., Lenstra, A.K., Leyland, P.C., 1994. The magic words
are squeamish ossifrage. In: Proceedings of the 4th International Conference on the
Theory and Applications of Cryptology. Wollongong, Australia. November 28
November-1 December 1994. pp. 261–77. NY: Springer.
137. Levy, S., 1999. The open secret. Wired, 7(4).
138. Ellis, J.H., 1999. The history of non-secret encryption. Cryptologia, 23(3),
pp. 267–73.
139. Ellis, J.H., 1970. The possibility of non-secret encryption. In: British
Communications-Electronics Security Group (CESG) report. January.
140. Bush, V., 1945. As we may think. The Atlantic. 176(1), pp. 101–8.
141. Manufacturing Intellect, 2001. Jeff Bezos interview on starting Amazon.
YouTube [online]. https://youtu.be/p7FgXSoqfnI. (Accessed 19 May 2020).
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
250 Bibliography
142. Stone, B., 2014. The Everything Store: Jeff Bezos and the Age of Amazon. New York:
Corgi.
143. Christian, B. and Griffiths, T., 2016. Algorithms to Live By. New York:
Macmillan.
144. Linden, G., Smith, B., and York, J., 2003. Amazon.com recommendations.
IEEE Internet Computing, 7(1), pp. 76–80.
145. McCullough, B., 2015. Early Amazon engineer and co-developer of
the recommendation engine, Greg Linden. Internet History Podcast
[online]. http://www.internethistorypodcast.com/2015/04/early-amazon-
engineer-and-co-developer-of-the-recommendation-engine-greg-
linden/#tabpanel6. (Accessed 15 February 19).
146. MacKenzie, I., Meyer, C., and Noble, S., 2013. How retailers can keep up
with consumers. Technical report, McKinsey and Company, October.
147. Anonymous, 2016. Total number of websites. Internet Live Stats [on-
line] http://www.internetlivestats.com/total-number-of-websites/. (Ac-
cessed 15 February 19).
148. Vise, D.A., 2005. The Google Story. New York: Macmillian.
149. Battelle, J., 2005. The birth of Google. Wired, 13(8), p. 102.
150. Page, L., Brin, S., Motwani, R., and Winograd, T., 1999. The PageRank
citation ranking: Bringing order to the web. Technical Report 1999–66,
Stanford InfoLab, November.
151. Brin, S. and Page, L., 1998. The anatomy of a large-scale hypertextual web
search engine. Computer Networks and ISDN Systems, 30(1–7), pp. 107–117.
152. Jones, D., 2018. How PageRank really works: Understanding Google. Ma-
jestic [blog]. https://blog.majestic.com/company/understanding-googles-
algorithm-how-pagerank-works/, October 25. (Accessed 12 July 2019).
153. Willmott, D., 1999. The top 100 web sites. PC Magazine, February 9.
154. Krieger, L.M., 2005. Stanford earns $336 million off Google stock. The
Mercury News, December 1.
155. Hayes, A., 2019. Dotcom bubble definition. Investopedia [online]. https://
www.investopedia.com/terms/d/dotcom-bubble.asp, June 25. (Accessed
19 July 2019).
156. Smith, B. and Linden, G., 2017. Two decades of recommender systems at
Amazon.com. IEEE Internet Computing, 21(3), pp. 12–18.
157. Debter, L., 2019. Amazon surpasses Walmart as the world’s largest
retailer. Forbes [online]. https://www.forbes.com/sites/laurendebter/
2019/05/15/worlds-largest-retailers-2019-amazon-walmart-alibaba/#
20e4cf4d4171. (Accessed 18 July 2019).
158. Anonymous, 2019. Tim Berners-Lee net worth. The Richest [online].
https://www.therichest.com/celebnetworth/celebrity-business/tech-
millionaire/tim-berners-lee-net-worth/. (Accessed 22 July 2019).
159. Anonymous, 2016. Internet growth statistics. Internet World Stats [on-
line]. http://www.internetworldstats.com/emarketing.htm. (Accessed 19
May 2020).
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 251
160. Conan Doyle, A., 1890. The sign of four. Lippincott’s Monthly Magazine.
February.
161. Kirkpatrick, D., 2010. The Facebook Effect. New York: Simon and Schuster.
162. Grimland, G., 2009. Facebook founder’s roommate recounts creation of
Internet giant. Haaretz [online]. https://www.haaretz.com/1.5050614. (Ac-
cessed 23 July 2019).
163. Kaplan, K.A., 2003. Facemash creator survives ad board. The Harvard Crimson,
November.
164. Investors Archive, 2017. Billionaire Mark Zuckerberg: Creating Facebook
and startup advice. YouTube [online]. https://youtu.be/SSly3yJ8mKU.
165. Widman, J., 2011. Presenting EdgeRank: A guide to Facebook’s Newsfeed
algorithm. http://edgerank.net. (Accessed 19 May 2020).
166. Anonymous 2016. Number of monthly active facebook users world-
wide. Statista [online] https://www.statista.com/statistics/264810/number-
of-monthly-active-facebook-users-worldwide/. (Accessed 20 May 2020).
167. Keating, G., 2012. Netflixed. London: Penguin.
168. Netflix, 2016. Netflix prize. http://www.netflixprize.com/. (Accessed 19
May 2020).
169. Van Buskirk, E., 2009. How the Netflix prize was won. Wired, September 22.
170. Thompson, C., 2008. If you liked this, you’re sure to love that. The New York
Times. November 21.
171. Piotte, M. and Chabbert, M., 2009. The pragmatic theory solution to the
Netflix grand prize. Technical report, Netflix.
172. Koren, Y., 2009. The Bellkor solution to the Netflix grand prize. Netflix Prize
Documentation, 81:1–10.
173. Johnston, C., 2012. Netflix never used its $1 million algorithm due to
engineering costs. Wired, April 16.
174. Gomez-Uribe, C.A. and Hunt, N., 2015. The Netflix recommender system:
Algorithms, business value, and innovation. ACM Transactions on Management
Information Systems, 6(4), pp. 13:1–19.
175. Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinkski, M.S.,
and Brilliant, L., 2009. Detecting influenza epidemics using search engine
query data. Nature, 457(7232), pp. 1012–14.
176. Cook, S., Conrad, C., Fowlkes, A.L., Mohebbi, M.H., 2011. Assessing
Google flu trends performance in the United States during the 2009
influenza virus A (H1N1) pandemic. PLOS ONE, 6(8), e23610.
177. Butler, D., 2013. When Google got flu wrong. Nature, 494(7436),
pp. 155.
178. Lazer, D., Kennedy, R., King, G., Vespignani, A., 2014. The parable of
Google Flu: Traps in big data analysis. Science, 343(6176), pp. 1203–5.
179. Lazer, D. and Kennedy, R., 2015. What we can learn from the epic failure
of Google flu trends. Wired, October 1.
180. Zimmer, B., 2011. Is it time to welcome our new computer overlords? The
Atlantic, February 17.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
252 Bibliography
181. Markoff, J., 2011. Computer wins on jeopardy: Trivial, it’s not. New York
Times, February 16.
182. Gondek, D.C., Lally, A., Kalyanpur, A., Murdock, J.W., Duboue, P.A.,
Zhang, L., Pan, Y., Qui, Z.M., and Welty, C., 2012. A framework for
merging and ranking of answers in DeepQA. IBM Journal of Research and
Development, 56(3.4), pp. 14:1–12.
183. Best, J., 2013. IBM Watson: The inside story of how the Jeopardy-winning
supercomputer was born, and what it wants to do next. TechRepublic
[online]. http://www.techrepublic.com/article/ibm-watson-the-inside-
story-of-how-the-jeopardy-winning-supercomputer-was-born-and-
what-it-wants-to-do-next/. (Accessed 15 February 2019).
184. Ferrucci, D.A., 2012. Introduction to this is Watson. IBM Journal of Research
and Development, 56(3.4), pp. 1:1–15.
185. IBM Research, 2013. Watson and the Jeopardy! Challenge. YouTube [on-
line]. https://www.youtube.com/watch?v=P18EdAKuC1U. (Accessed 15
September 2019).
186. Lieberman, H., 2011. Watson on Jeopardy, part 3. MIT Technology Re-
view [online]. https://www.technologyreview.com/s/422763/watson-on-
jeopardy-part-3/. (Accessed 15 September 2019).
187. Gustin, S., 2011. Behind IBM’s plan to beat humans at their own game.
Wired, February 14.
188. Lally, A., Prager, M., McCord, M.C., Boguraev, B.K., Patwardhan, S., Fan,
J., Fodor, P., and Carroll J.C., 2012. Question analysis: How Watson reads a
clue. IBM Journal of Research and Development, 56(3.4), pp. 2:1–14.
189. Fan, J., Kalyanpur, A., Condek, D.C., and Ferrucci, D.A., 2012. Automatic
knowledge extraction from documents. IBM Journal of Research and Develop-
ment, 56(3.4), pp. 5:1–10.
190. Kolodner, J.L., 1978. Memory organization for natural language data-base
inquiry. Technical report, Yale University.
191. Kolodner, J.L., 1983. Maintaining organization in a dynamic long-term
memory. Cognitive Science, 7(4), pp. 243–80.
192. Kolodner, J.L., 1983. Reconstructive memory: A computer model. Cogni-
tive Science, 7(4), pp. 281–328.
193. Lohr, S., 2016. The promise of artificial intelligence unfolds in small steps.
The New York Times, February 29. (Accessed 19 May 2020).
194. James, W., 1890. The Principles of Psychology. NY: Holt.
195. Hebb, D.O., 1949. The Organization of Behavior. NY: Wiley.
196. McCulloch, W.S. and Pitts, W., 1943. A logical calculus of the ideas
immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4),
pp. 115–33.
197. Gefter, A., 2015. The man who tried to redeem the world with logic.
Nautilus, February 21.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 253
198. Whitehead, A.N. and Russell, B., 1910–1913. Principia Mathematica. Cam-
bridge: Cambridge University Press.
199. Anderson, J.A. and Rosenfeld, E., 2000. Talking Nets. Cambridge, MA: The
MIT Press.
200. Conway, F. and Siegelman, J., 2006. Dark Hero of the Information Age: In Search
of Norbert Wiener The Father of Cybernetics. New York: Basic Books.
201. Thompson, C., 2005. Dark hero of the information age: The original
computer geek. The New York Times, March 20.
202. Farley, B.W.A.C. and Clark, W., 1954. Simulation of self-organizing systems
by digital computer. Transactions of the IRE Professional Group on Information
Theory, 4(4), pp. 76–84.
203. Rosenblatt, F., 1958. The Perceptron: A probabilistic model for informa-
tion storage and organization in the brain. Psychological Review, 65(6), pp. 386.
204. Rosenblatt, F., 1961. Principles of neurodynamics. perceptrons and the
theory of brain mechanisms. Technical report, DTIC Document.
205. Anonymous, 1958. New Navy device learns by doing. The New York Times,
July 8.
206. Minsky, M. and Papert, S., 1969. Perceptrons. Cambridge, MA: The MIT Press.
207. Minksy, M., 1952. A neural-analogue calculator based upon a probability
model of reinforcement. Technical report, Harvard University Psycholog-
ical Laboratories, Cambridge, Massachusetts.
208. Block, H.D., 1970. A review of Perceptrons: An introduction to computa-
tional geometry. Information and Control, 17(5), pp. 501–22.
209. Anonymous, 1971. Dr. Frank Rosenblatt dies at 43; taught neurobiology
at Cornell. The New York Times, July 13.
210. Olazaran, M., 1996. A sociological study of the official history of the
Perceptrons controversy. Social Studies of Science, 26(3), pp. 611–59.
211. Werbos, P.J., 1990. Backpropagation through time: What it does and how
to do it. Proceedings of the IEEE, 78(10), pp. 1550–60.
212. Werbos, P.J., 1974. Beyond regression: New tools for prediction and analysis in the
behavioral sciences. PhD. Harvard University.
213. Werbos, P.J., 1994. The Roots of Backpropagation, volume 1. Oxford: John Wiley
& Sons.
214. Werbos, P.J., 2006. Backwards differentiation in AD and neural nets: Past
links and new opportunities. In: H.M. Bücker, G. Corliss, P. Hovland,
U. Naumann, and B. Norris, eds., Automatic differentiation: Applications, theory,
and implementations. Berlin: Springer. pp. 15–34.
215. Parker, D.B., 1985. Learning-logic: Casting the cortex of the human brain
in silicon. Technical Report TR-47, MIT, Cambridge, MA.
216. Lecun, Y., 1985. Une procédure d’apprentissage pour réseau a seuil asym-
metrique (A learning scheme for asymmetric threshold networks). In:
Proceedings of Cognitiva 85. Paris, France. 4–7 June, 1985. pp. 599–604.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
254 Bibliography
217. Rumelhart, D.E., Hinton, G.E., and Williams, R.J., 1986. Learning repre-
sentations by back-propagating errors. Nature, 323, pp. 533–36.
218. Hornik, K., Stinchcombe, M., and White, H., 1989. Multilayer feedforward
networks are universal approximators. Neural Networks, 2(5), pp. 359–66.
219. Ng., A., 2018. Heroes of Deep Learning: Andrew Ng interviews Yann
LeCun. YouTube [online]. https://www.youtube.com/watch?v=
Svb1c6AkRzE. (Accessed 14 August 2019).
220. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D., 1989. Backpropagation applied to handwrit-
ten zip code recognition. Neural Computation, 1(4), pp. 541–51.
221. Thorpe, S., Fize, D., and Marlot, C., 1996. Speed of processing in the human
visual system. Nature, 381(6582), pp. 520–2.
222. Gray, J., 2017. U of T Professor Geoffrey Hinton hailed as guru of new
computing era. The Globe and Mail, April 7.
223. Allen, K., 2015. How a Toronto professor’s research revolutionized artifi-
cial intelligence. The Star. April 17.
224. Hinton, G.E., Osindero, S., and Teh, Y.W., 2006. A fast learning algorithm
for deep belief nets. Neural Computation, 18(7), pp. 1527–54.
225. Ciresan, D.C., Meier,U., Gambardella, L.M., and Schmidhuber, J., 2010.
Deep big simple neural nets excel on handwritten digit recognition. arXiv
preprint arXiv:1003.0358.
226. Jaitly, N., Nguyen, P., Senior, A.W., and Vanhoucke, V., 2012. Application of
pretrained deep neural networks to large vocabulary speech recognition.
In: Proceedings of the 13th Annual Conference of the International Speech Communication
Association (Interspeech). Portland, Oregon, 9–13 September 2012. pp. 257–81.
227. Hinton, G., et al., 2012. Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE Signal
Processing Magazine, 29(6), pp. 82–97.
228. Krizhevsky, A., Sutskever, Il, and Hinton, G.E., 2012. ImageNet classi-
fication with deep convolutional neural networks. In: C. Burges, ed.,
Proceedings of the 27th Annual Conference on Neural Information Processing Systems
2013. 5–10 December 2013, Lake Tahoe, NV. Red Hook, NY Curran.
pp. 1097–1105.
229. Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C., 2003. A neural prob-
abilistic language model. Journal of Machine Learning Research, 3, pp. 1137–55.
230. Sutskever, I., Vinyals, O., and Le, Q.V., 2014. Sequence to sequence learning
with neural networks. In: Z. Ghahramani, M. Welling, C. Cortes, N.D.
Lawrence, and K.Q. Weinberger, eds., Proceedings of the 28th Annual Conference
on Neural Information Processing Systems 2014. 8–13 December 2014 , Montreal,
Canada. Red Hook, NY Curran. pp. 3104–12.
231. Cho, K., Van Merriënboer, B., Bahdanau, and Bengio, Y., 2014. On the
properties of neural machine translation: Encoder-decoder approaches.
arXiv preprint arXiv:1409.1259.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
Bibliography 255
232. Bahdanau, D., Cho, K, and Bengio, Y., 2014. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:
1409.0473.
233. Wu, Y, et al., 2016. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation. arXiv preprint
arXiv:1609.08144.
234. Lewis-Kraus, G., 2016. The great A.I. awakening. The New York Times,
December 20.
235. LeCun, Y, Bengio, Y., and Hinton, G., 2015. Deep learning. Nature.
521(7553), pp. 436–44.
236. Vincent, J., 2019. Turing Award 2018: Nobel prize of computing given to
‘godfathers of AI’. The Verge [online]. https://www.theverge.com/2019/3/
27/18280665/ai-godfathers-turing-award-2018-yoshua-bengio-geoffrey-
hinton-yann-lecun. (Accessed 19 May 2020).
237. Foster, R.W., 2009. The classic of Go. http://idp.bl.uk/. (Accessed 20 May
2020).
238. Moyer, C., 2016. How Google’s AlphaGo beat a Go world champion.
The Atlantic, March.
239. Morris, D.Z., 2016. Google’s Go computer beats top-ranked human. For-
tune, March 12.
240. AlphaGo, 2017 [film]. Directed by Greg Kohs. USA: Reel as Dirt.
241. Wood, G., 2016. In two moves, AlphaGo and Lee Sedol redefined the
future. Wired, March 16.
242. Metz, C., 2016. The sadness and beauty of watching Google’s AI play Go.
Wired, March 11.
243. Edwards, J., 2016. See the exact moment the world champion of Go
realises DeepMind is vastly superior. Business Insider [online]. https://
www.businessinsider.com/video-lee-se-dol-reaction-to-move-37-and-
w102-vs-alphago-2016-3?r=US&IR=T. (Accessed 19 May 2020).
244. Silver, D., et al., 2016. Mastering the game of Go with deep neural net-
works and tree search. Nature, 529(7587), pp. 484–9.
245. Hern, A., 2016. AlphaGo: Its creator on the computer that learns by
thinking. The Guardian, March 15.
246. Burton-Hill, C., 2016. The superhero of artificial intelligence: can this
genius keep it in check? The Guardian, February 16.
247. Fahey, R., 2005. Elixir Studios to close following cancellation of key
project. gamesindustry.biz [online]. https://www.gamesindustry.biz/
articles/elixir-studios-to-close-following-cancellation-of-key-project.
(Accessed 19 May 2020).
248. Mnih, V., et al., 2015. Human-level control through deep reinforcement
learning. Nature, 518(7540), p. 529.
249. Silver, D., et al., 2017. Mastering the game of Go without human knowl-
edge. Nature, 550(7676), pp. 354–9.
OUP CORRECTED PROOF – FINAL, 15/7/2020, SPi
256 Bibliography
Bibliography 257
Index
260 Index
Index 261
262 Index
Index 263
264 Index
Google 126, 127, 156–158, 168, 169, 197–198, Hindu–Arabic numerals 30–31
200, 204, 206, 227–228 On the Hindu Art of Reckoning 30–31
AdWords 157 Hinton, Geoffrey 192, 194–200
Cloud 30 photograph 195
DeepMind Challenge 203 Hoare, Tony 4
Flu Trends 168 Holberton, Betty 60
Labs 196 Holland 103
PageRank 152–157 Amsterdam 103
Translate 199–200 Holland, John 110–115
Government Codes and Ciphers photograph 110
School 52 quote 111
Government Communications Holmes, Sherlock 148
Headquarters (GCHQ) 140–141 Horn, Paul 172
Graph Colouring Problem 102 Hot Or Not 159
graphical user interface (GUI) 124, 144 House of Wisdom 30–31
greatest common divisor (GCD) 20–21 Huang, Aja 203, 205, 206
Greece 133 Hughes, Eric
Athens 22 quote 216
early mathematics 15, 18, 23 Hui, Fan 206, 212
language 30 quote 206
greyscale 179 Hui, Liu 29
human brain 180–181
Halting Problem 49–51 Hungary
diagram 50 Budapest 61
Hamilton, William 93 Hypatia 23
Hamming, Richard 129–130, 132–133
photograph 129 IBM 54, 73, 86, 110, 124, 125, 129, 135, 144,
quote 130 170, 178, 183, 200, 227
Hammurabi, King 11, 13 Jeopardy Challenge 171–178
Code of 17 PC 144, 146
hardware 7 Watson 171–178
Harmon, Leon 61 quote 173–174
harmonics 33–34, 36–37 Watson Center 135, 171
Harrow School 76 image recognition 197–198
Harvard Mark I 54 ImageNet 198
Harvard University 54, 64, 110, 118, 159, 189 India 31, 45, 133
hashing algorithm 219–220 language 30
Hassabis, Demis 205, 210–212 Industrial Revolution 37,
photograph 210 39–40, 112
quote 211 infinite loop 49
Hastings, Reed 163, 168 influenza 168–170
Hebb, Donald 180 Insertion Sort 3–4, 6
Hellman, Martin 134–136, 141 computational complexity 98
Helsgaun, Keld 97 diagram 3
Herbert, Simon 82–86 Institute for Advanced Study (IAS)
photograph 83 61, 64, 68
Heron of Alexandria 16 computer 67–68, 83
Heron’s algorithm 16 integrated circuit 72–73, 228
Hidden Markov Model (HMM) 197–198 Intergalactic Computer Network 118
OUP CORRECTED PROOF – FINAL, 13/7/2020, SPi
Index 265
266 Index
Index 267
268 Index
Index 269
270 Index
Index 271