Machine Learning - An Applied Mathematics Introduction PDF
Machine Learning - An Applied Mathematics Introduction PDF
•r-
i
1 T
I
i
V
t
f:
#
a 19
\
L \ L
S'?'
J
-t p?
M 3 *5»
FiKa
m %
fs k m.
mm ’y
•r/f/.
a/
■rr
Q !^t/
.ilL^
to IJ5 (.
fM ■K
wm .y/ ( s
0 %
m
u
■\ Al /K
$
( A /m
/
"V?
•)>5,
1^ V.
A /
/d
/
S3liVllil]HlVllilQ]llddVNV
^ 9NINIIV]] IMM
MACHINE LEARNING
adminSpandaohana.com
ISBN 978-1-9160816-0-4
An Applied Mathematics
Introduction
First Edition
Paul Wilmott
WWW.wilmott.com
Prologue XI
1 Introduction 1
1,1 The Topic At Hand 2
1.2 Learning Is Key 3
1.3 A Little Bit Of History 4
1.4 Key Methodologies Covered In This Book 6
1.5 Classical Mathematical Modelling . . . . 9
1.6 Machine Learning Is Different 11
1.7 Simplicity Leading To Complexity . . . . 12
2 General Matters 17
2.1 Jargon And Notation 17
2.2 Scaling 18
2.3 Measuring Distances 19
2.4 Curse Of Dimensionality . . . . 20
2.5 Principal Components Analysis . 21
2,6 Maximum Likelihood Estimation 22
2.7 Confusion Matrix 26
2.8 Cost Functions 28
2.9 Gradient Descent 33
2.10 Training, Testing And Validation 35
2.11 Bias And Variance 37
2.12 Lagrange Multipliers 44
2.13 Multiple Classes 45
2.14 Information Theory And Entropy 47
2.15 Natural Language Processing . . 50
2.16 Bayes Theorem 51
2,17 What Follows 53
VI I
VIII CONTENTS
3 K Nearest Neighbours 55
3.1 Executive Summary 55
3.2 What Is It Used For? 55
3.3 Flow It Works 56
3.4 The Algorithm 58
3.5 Problems With KNN 58
3.6 Example: Heights and weights 59
3.7 Regression 62
4 K Means Clustering 65
4.1 Executive Summary 65
4.2 What Is It Used For? . 65
4.3 What Does K Means Clustering Do? . . 67
4.4 Scree Plots 71
4.5 Example: Crime in England, 13 dimensions 72
4.6 Example: Volatility 75
4.7 Example: Interest rates and inflation . . . 77
4.8 Example: Rates, inflation and GDP . . . 80
4.9 A Few Comments 81
6 Regression Methods 91
6.1 Executive Summary 91
6.2 What Is It Used For? 91
6.3 Linear Regression In Many Dimensions 92
6.4 Logistic Regression 93
6.5 Example: Political speeches again . . 95
6.6 Other Regression Methods 96
Datasets 217
Epilogue 221
Index 223
Prologue
This book aims to teach you the most important mathematical foundations
of as many machine-learning techniques as possible. I also leave out what I
judge to be the unnecessary boring bits. It is also meant to be orthogonal
to most other books on and around this subject.
If you've already bought this book then you may want to skip the next
few paragraphs. In them I explain who the book is for. If you have bought
the book and it turns out you are the wrong audience then... Oops.
There are several different types of reader of machine-learning textbooks.
And different quantities of those.
Judging by what's available on Amazon I reckon that there are an awful
lot of people out there who Just want to see the relevant lines of code. Often
in Python or R. There are plenty of books that wil l show you a surprisingly
compact coded-up representation of the solution of a problem. I'm not going
to worry about these people. You are well catered for already.
Then there are quite a lot of books that describe the uses of machine
learning techniques. But this time not much code, and very elementary
mathematics. These are great for getting an overview of the subject. But
they lack in detail. You can spot them by the dodgy typeface they use
for their mathematics. And that they are self published, although there is
absolutely nothing wrong with that at all.
And then there are the specialist books. Each devoted to a narrow field,
but extremely deep. I can't fault their typefaces. But they are a little bit
scary. And usually very, very expensive.
No, I'm not trying to compete with any of the above.
I am aiming for a small audience, not well catered for. An audience,
probably shrinking as I type. An audience of applied mathematicians. People
who like doing mathematics, solving problems, perhaps old hands at classical
techniques and keen to know more about this new data-driven revolution
that they are on the edges of. You know who you are. Yes, you at the back,
don't try sneaking out, sit down here at the front and let’s learn this subject
together.. .
XI
XI I PROLOGUE
But first, I think we need to hear what The Machine has to say about
this bookT (It is very insightful. I particularly liked the endorsement at the
end.)
These volumes are meant to be in multiple sections. You can read them
in this book at each of the four different places, from beginning to end. The
main point you want to make here is that both the book and the text on as
many of the techniques that you will use will be well-understood. But what
will be done to make them all accessible, and that they won’t be too hard to
understand? In fact, each of the techniques will be quite easy to understand,
and the material will be very well organized so that you can do what you
want in the most short amount of time (i.e., in less than 40 pages). A few
basic examples of different types of machine-learning can be found in the
book, but I'll focus on the main points of the book. If you're interested in
reading more about this subject, then see my book.
Acknowledgements
Professionally speaking...
Paul Wilmott studied mathematics at St Catherine's College, Oxford,
where he also received his D.Phil. He is the author of Paul Wilmott In
troduces Quantitative Finance (Wiley 2007), Paul Wilmott On Quantitative
Finance (Wiley 2006), Frequently Asked Questions in Quantitative Finance
(Wiley 2009), The Money Formula: Dodgy Finance, Pseudo Science, and
Flow Mathematicians Took Over the Markets (with David Orrell) (Wiley
2017) and other financial textbooks. He has written over 100 research ar
ticles on finance and mathematics. Paul Wilmott was a founding partner
of the volatility arbitrage hedge fund Caissa Capital which managed $170
million. His responsibilities included forecasting, derivatives pricing, and risk
management.
Paul is the proprietor of www.wilmott.com, the popular quantitative
finance community website, and the quant magazine Wilmott. He is the cre
ator of the Certificate in Quantitative Finance, cqf.com, and the President
of the CQF Institute, cqfinstitute.org.
Introduction
1
2 CHAPTER 1. INTRODUCTION
said all that I am sure that some readers will take it upon themselves to find
as many errors in this book as they can. If you do find any then email me
at paulOwilmott.com in the first instance before reviewing me on Amazon.
I'm very approachable.)
This book does not include any code whatsoever. This is deliberate for
two reasons. Mainly it's because I am a useless programmer. But also the
upside is that I do have to teach the methods with enough detail that you
can implement them yourself. I cannot take any shortcuts such as giving
a no-math description of a technique followed by saying "Now write these
two lines of code, press enter, and look at the pretty pictures." There are
many programming packages you can use to implement the methods that
I describe and you don't need to know what they are doing to use them.
That's fine. It's like my feeble understanding of a carburettor doesn't stop me
enjoying my Jensen Interceptor Convertible. But equally if I did understand
the carburettor it would probably save me quite a bit of time and expense.
Finally, there are plenty of examples in this book, almost always using
real data. But this is not a research-level tome, and so the examples are for
illustration only. If it were a book describing the cutting edge of research
then there would be a heck of a lot more cleaning and checking of the data
and a lot more testing and validation of the models. But there'd also be
a lot less explanation of the basics, and perhaps less mentioning of where
things go wrong.
But now let's actually talk about the topic at hand.
Pierre de Fermat. Their problem was to figure out how to allocate a bet
in a game of dice (called Points) if the game were to be suspended part
way through, one side of the game doing better than the other. Thus
was invented/discovered the concept of mathematical expectations.
Take data and statistical techniques, throw them at a computer and what
you have is machine learning.
Except that there's more to it than that. And there's a clue in the
title of the subject, in the word "learning.” In traditional modelling you
would sit down with a piece of paper, a pencil and a single malt and. . .tell
the algorithm everything you know about, say, chess. You’d write down
some code like “IF White Queen something something Black Bishop THEN
something something." (I haven't played chess for forty years so apologies for
a lack of precision here.) And you’d have many, many lines of IF, AND and
OR, etc. code telling the programme what to do in complex situations, with
nuances in positions. Or .. . you'd write down some differential equation that
captures how you believe that inflation will respond to a change in interest
rates. As inflation rises so a central bank responds by increasing interest
rates. Increasing interest rates causes.. .
You might eyeball the data a bit, maybe do a smal l amount of basic
statistics, but mainly you’d just be using your brain.
Whatever the problem you would build the model yourself.
4 CHAPTER 1. INTRODUCTION
In machine learning your role in modelling is much more limited. You will
decide on a framework, whether it be a neural network or a support vector
machine, for example, and then the data does the rest. (It won't be long
now before the machine even does that for you.) The algorithm learns.
As I said. I'm not going to give you much history in this book. That
would be pointless, history is always being added to and being reassessed,
and this field is changing very rapidly. But I will tell you the two bits of
machine-learning history that I personally find interesting.
First let me introduce Donald Michie. Donald Michie had worked on
cyphers and code cracking with his colleague Alan Turing at Bletchley Park
during the Second World War. After the war he gained a doctorate and in the
early 1960s, as a professor in Edinburgh, turned his attention to the problem
of training — a word you will be hearing quite a lot of — a computer to play
the game of Noughts and Crosses, a.k.a. Tic Tac Toe. Well, not so much
a computer as an array of matchboxes, sans matches. Three hundred and
four matchboxes, laid out to represent the stages of a game of Noughts and
Crosses. I won't explain the rules of O&Xs,just like further down I won't be
explaining the rules of Go, but each stage is represented by a grid on which
one player has written an X alternating with the other player writing an O
with each player's goal being to get three of their symbols in a row. (Damn,
I did explain the rules. But really do not expect this for Go. Mainly because
I don't fully understand the game myself.)
fill each of his 304 matchboxes with coloured beads, each colour representing
one of the empty cells in the grid. (He took advantage of symmetries or there
would have been even more than 304 matchboxes.) When it's our move we
take the matchbox representing the current state, shake it, and remove a
random bead. It’s colour tells us where to place our X. This is followed by
our opponent's play. The game continues, and we move from state to state
and matchbox to matchbox. And if we win then each matchbox used in that
game gets rewarded with extra beads of its chosen colour, and if we lose then
a bead of that colour gets taken away. Eventually the matchboxes fill up with
beads representing the most successful action at each state. The machine,
and Michie called it MENACE, for Machine Educable Noughts And Crosses
Engine, has learned to play Noughts and Crosses. There’s some jargon in
the above: Training; State; Action; Reward; Learn. We'll be hearing more
of these later. And we shall return to Donald's matchboxes in the chapter
on reinforcement learning.
The next thing that I found interesting was the story of how Google
DeepMind programmed a computer to learn to play Go. And it promptly
went on to beat Lee Sedol, a 9-dan professional Go player. (No, I didn't
know you could become a 9th dan by sitting on your butt either. There's
hope for us all.) AlphaGo used a variety of techniques, among them neural
networks. Again it learned to play without explicit programming of opti
mal play. Curiously not all of that play was against human opponents... it
also played against itself. To show how difficult Go is, and consequently
how difficult was the task for Google DeepMind, if you used matchboxes to
represent states a la Donald Michie then you’d need a lot more than 304
of them, you’d probably have to fill the known universe with matchboxes.
I would strongly recommend watching the creatively titled movie "The Al
phaGo Movie" to see a better description than the above and also to watch
the faces of everyone involved. Lee Sedol started moderately confident, be
came somewhat gutted, and finally resigned (literally and psychologically).
Most of the programmers were whooping for Joy as the games progressed.
Although there was one pensive-looking one who looked like he was thinking
"Oh.. . my.. . God.. . what.. . have.. . we.. . done?"
At times the computer made some strange moves, moves that no profes
sional Go player would have considered. In the movie there was discussion
about whether this was a programming error, lack of sufficient training, or a
stroke of computer genius. It was always the last of these. It has been said
that the game of Go has advanced because of what humans have learned
from the machine. (Did a chill Just go down your spine? It did mine.)
In this book I am going to guide you through some of the most popular
machine-learning techniques, to the level at which you can start trying them
out for yourself, and there'l l be an overview of some of these for the rest of
this chapter.
6 CHAPTER 1. INTRODUCTION
There are three main categories of machine learning. Two of these cate
gories concern how much help we give the machine. And the third is about
teaching a machine to do things.
Principal categories
Principal techniques
data points being in different classes. The naive bit refers to the
assumptions made, which are rarely true in practice but that doesn't
usually seem to matter.
What drives your system? Are there forces pushing and pulling some
thing? Are molecules bouncing off each other at random? Do buyers and
sellers move the price of milk up and down?
If you are lucky you'l l have some decent principles to base your model
10 CHAPTER 1. INTRODUCTION
(There's also the category of simply feeling right, such as the inverse
square law of gravity.)
But maybe you have to fall back on some common sense or observation.
To model the dynamical system of lions and gazelles (the former like to
eat the latter) then you would say that the more lions there are the more
they breed, ditto for gazelles. And the more gazelles there are the more
food for lions, which is great for the lions but not so much for the gazelles.
From this it is possible to write down a nice differential-equation model.
The 'cello-string model will give accurate behaviour for the string, but while
the lion-gazelle model is usefully explanatory of certain effects it will not
necessarily be quantitatively accurate. The latter is a toy model.
Your model will probably need you to give it values for some parameters.
Again if you are lucky you might be able to get these from experiment; the
acceleration due to gravity, or the viscosity of a fluid. And perhaps those
parameters will stay constant. But maybe the parameters will be difficult to
measure, unstable, or perhaps not even exist, the volatility of a stock price
for example.
These are issues that the mathematical modeller needs to address and
will make the difference between a model with reliable outputs and one that
is merely useful for explanation of phenomena.
There are many branches of mathematics, some are more easily employed
than others when it comes to practical problems.
Discrete mathematics, or continuous mathematics? Will you be using
calculus, perhaps ordinary or partial differential equations? Or perhaps the
model is going to use probabilistic concepts. If you are trying to figure out
how to win at Blackjack then clearly you'll be working with discrete values
for the cards and probability will play a major role. As will optimization.
Change to poker and game theory becomes important.
Mathematicians do suffer from the classic “If all you have is a hammer
every problem becomes a nail." This can affect the evolution of problem
solutions, as the problem becomes known as, say, a problem in subject X,
when subject Y might also be useful.
1.6. MACHINE LEARNING IS DIFFERENT 11
Setting out a problem and a model is nice. But you’ll almost always want
to find its solution. If lucky, again, you'l l get a formula using a pencil and pa
per. More often than not, especially if it’s a real-world, often messy, problem
then you’ll have to do some number crunching. There are many numerical
methods, many for each branch of mathematical modelling. Numerical so
lution is different from modelling, don’t confuse the two. Numerical analysis
is a subject in its own right.
And the role of type of mathematics would become the choice of machine
learning scheme. Is the problem best approached via supervised or unsuper
vised learning? Would it be a self-organizing map or K means clustering
that gave the most reliable results?
Very often the building blocks of a machine-learning algorithm are very
simple. A neural network might involve matrix multiplications and a few
simple non-linear functions. Simple models can lead to complex behaviour.
But is it the right simple model and a quantitatively useful behaviour?
12 CHAPTER 1. INTRODUCTION
Cellular automata
• Any live cell with fewer than two live neighbours dies, as if caused by
underpopulation
• Any live cell with two or three live neighbours lives on to the next
generation.
• Any live cell with more than three live neighbours dies, as if by over
population
• Any dead cell with exactly three live neighbours becomes a live cell,
as if by reproduction
1.7. SIMPLICITY LEADING TO COMPLEXITY 13
https://www.youtube.com/watch?v=iiEQg-SHYlg.
I worked on a cellular automaton model in the early 1990s, a model
representing the separation one finds in granular material when they are of
different sizes. This model was created because of the lack of progress we
had made in trying to develop a more classical continuum model.
Fractals
0.00 0.00 X
f\{-c y) = 0.00 0.16 y
-e-
-3 -2 -1 0 1 2 3
Chaos
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Those last few pages were rather tangential to our topic. But let me
reassure you that you have not been charged for them.
I mention all of the above as both an encouragement and as a warning to
anyone embarking on machine learning. On a positive note we see complex
behaviour arising out of the simplest of models. And some of the machine
learning you will be seeing can be very simple indeed. The building blocks
of basic neural networks are just matrix multiplication and a fairly bland
activation function.
On the other hand once we have represented something quite compli
cated, such as millions of data points, with something simple, how do we
ever know that it’s the right simple model?
I recall a great deal of excitement in my mathematics department in the
late 1980s because a group had discovered they could use dynamical systems
(chaos) to model the foreign exchange markets. As I recall the excitement
soon faded once they learned more about how the currency markets actually
worked, and that their forecasts were essentially useless. (I could tell by
looking at their shoes that they hadn't struck it rich.)
16 CHAPTER 1. INTRODUCTION
Further Reading
General Matters
There are some issues that are common to many branches of machine learn
ing. In this chapter I put these al l together for some brief discussion. More
often than not in later chapters I will simply refer you to this chapter for
further consideration.
Every technical subject has its own Jargon, and that includes machine
learning. Important Jargon to get us started is as follows. (Every chapter
will introduce more.)
One small complaint here: Some rather simple things I've grown up with
in applied mathematics seem to be given fancy names when encountered in
machine learning. I'll say no more.
17
18 CHAPTER 2. GENERAL MATTERS
For example, you might want to group cats into different breeds, and
the features might be length of fur (numerical) or whether or not it
has a tail (descriptive but easily represented by value zero or one). (No
tail? If you are wondering, the Manx cat is tailless.) Each data point
will often be represented by a feature vector, each entry in the vector
representing a feature.
Vectors are bold face, and are column vectors. The vector of features
of the n**’ individual wil l be
2.2 Scaling
between data points so we must often make sure that the scales for each
feature are roughly the same.
Suppose we were characterising people according to their salary and num
ber of pets. If we didn't scale then the salary numbers, being in the thou
sands, would outweigh the number of pets. Then the number of pets a
person had would become useless, which would be silly if one was trying to
predict expenditure on dog food.
All we have to do is adjust entries for each feature by adding a number
and multiplying by a number so as to make sure that the scales match.
This is easy to do. And there are a couple of obvious ways (a second
is in the parentheses below). With the original unsealed data take one of
the features, say the top entry in the feature vector, measure the mean and
standard deviation (or minimum and maximum) across all of the data. Then
translate and rescale the first entries in all of the feature vectors to have a
mean of zero and a standard deviation of one (or a minimum of zero and a
maximum of one). Then do the same for the second entry, and so on.
Just to be clear we use the same translation and scaling for the same
feature across all data points. But it will be a different translation and scaling
for other features.
This scaling and translation is a common first step in many machine
learning algorithms. I don’t always mention the rescaling at the start of each
chapter so please take it as read.
And don’t forget that when you have a new sample for prediction you
must scale all its features first, using the in-sample scaling.
One small warning though, if you have outliers then they can distort
the rest of the data. For example one large value for one feature in one
sample can, after rescaling, cause the values for that feature for the rest of
the samples to be tiny and thus appear to be irrelevant when you measure
distances between feature vectors. Flow important this effect is will depend
on what type of rescaling you use. Common sense will help determine if and
how you should rescale.
As I Just said, we’l l often be working with vectors and we will want to
measure the distances between them. The shorter the distance between two
feature vectors the closer in character are the two samples they represent.
There are many ways to measure distance.
This would be the default measure, as the crow flies. It’s the norm.
There are only seven billion people on the planet but certain companies
are trying to collect so much data on each of them/us that you can imagine
the curse of dimensionality becoming an issue. The solution? Don't collect
so much data, Mark.
In practice things are not usually as bad as all that, this is very much a
worst-case scenario. It is certainly possible that there is some relationship
between features so that your 13-dimensional, say, problem might only be
effectively in six dimensions. Sometimes you might want to consider doing
some preparatory analysis on your features to see if the dimension can be
reduced. Which leads us onto . . .
You arrive at the train station in a city you've never been to before. You
go to the taxi rank so as to get to your final destination. There is one taxi,
you take it. While discussing European politics with the taxi driver you notice
that the cab’s number is 1234. How many taxis are in that city?
To answer this we need some assumptions. Taxi numbers are positive
integers, starting at 1, no gaps and no repeats. We'll need to assume that we
are equally likely to get into any cab. And then we introduce the parameter
N as the number of taxis. What is the MLE for N7
Well, what is the probability of getting into taxi number 1234 when there
are N taxis? It is
Suppose you toss a coin n times and get h heads. What is the probability,
p, of tossing a head next time?
The probability of getting h heads from n tosses is, assuming that the
tosses are independent.
n! n— h
n
n-h
p^{i-p)
h\{n-h)\ \h
Applying MLE is the same as maximizing this expression with respect to p.
This likelihood function (without the coefficient in the front that is indepen
dent of p) is shown in Figure 2.1 for n = 100 and h = 55. There is a very
obvious maximum.
2.6. MAXIMUM LIKELIHOOD ESTIMATION 23
Likelihood (scaled)
1.4E-30
1.2E-30
lE-30
8E-31
6E-31
4E-31
2E-31
P
0
Often with MLE when multiplying probabilities, as here, you will take
the logarithm of the likelihood and maximize that. This doesn’t change the
maximizing value but it does stop you from having to multiply many small
numbers, which is going to be problematic with finite precision. (Look at
the scale of the numbers on the vertical axis in the figure.) Since the first
part of this expression is independent of p we maximize
Log likelihood
P
0
-50
-100
-150
-200
-250
-300
This just means differentiating with respect top and setting the derivative
equal to zero. This results in
h
p=-
n
For the final example let’s say we have a hat ful l of random numbers
drawn from a normal distribution but with unknown mean and standard
deviation, that's two parameters. The probability of drawing a number x
from this hat is
1
p{x) exp
2cr2
V^cr
where p, is the mean and a the standard deviation which are both to be
estimated. The log likelihood is then the logarithm of this, i.e.
1
ln(p(a;)) = —— 1ii(27t) — In cr — 20-2
{x -pf.
If draws are independent then after N draws, .r„, the likelihood is just
p{xi)p{x2)...p{xN) = Y[p{Xn)-
n=l
N
/N
In ^ln(p(x„)).
\n=l n=l
N
1
-Nina -
2o-2 E( Xn (2.2)
n=l
for the normal distribution. (I've dropped a constant at the front because it
doesn't affect the position of the maximum.)
An example is plotted in Figure 2.3. For this plot I generated ten random
xs from a normal distribution with mean of 0.1 and standard deviation of
0.2. These ten numbers go into the summation in expression (2.2), and I've
plotted that expression against /n and a. The maximum should be around
p = 0.1 and a = 0.2. (It will be at exactly whatever is the actual mean
and standard deviation of the ten random numbers I generated, as I'll show
next.)
2.6. MAXIMUM LIKELIHOOD ESTIMATION 25
Log likelihood
11
1 0.18
0.12
^ ^ m
^ in lO 0.06 £
00 cn
o d d _ (N
ro 0
fN S
(N
m
O d rM fN
o o rsl
o
d d
Sigma
To find the maximum likelihood estimate for /r you just differentiate with
respect to jj, and set this to zero. This gives
N
1
N
n=l
I.e. ij, is the simple average. And differentiating with respect to a gives
N
1
a =
N
(2.3)
n—1
ACTUAL CLASS
11 9
Q Apple
K- TP FP
53
Uj U 5 75
ce
a. Non apple
FN TN
Sixteen of the 100 are apples. The rest are a mix of pears, oranges,
bananas etc. Our algorithm labels each item as whatever fruit it thinks it is.
It thinks there are 20 apples. Unfortunately the algorithm had both identified
as apples some that weren’t and some that were apples it missed. Only 11 of
the apples did it correctly identify. So it had some false positives and some
false negatives. This is shown in Figure 2.4.
In order to interpret these numbers they are often converted into several
rates:
TP+TN
• Accuracy rate: where Total = TP + TN + FP + FN. This
measures the fraction of times the classifier is correct
• Error rate: 1 —
TP+TN
Total
2.7. CONFUSION MATRIX 27
TP
• True positive rate or Recall: jp^p|\|
FP
• False positive rate:■ TN+FP
FP
• True negative rate or Specificity: 1 — TNTFP
TP
• Precision: If the algorithm predicts apple how often is it
TP+FP-
correct?
Considering that there are only four numbers here (three if you scale to
have 100, say, as the total) there are an awful lot of ways they are sliced and
diced to give different interpretative measures. On the Wikipedia page for
Confusion Matrix I counted 13 different measures. Here's a good one, the
Matthews correlation coefficient, defined by
TP X TN - FP X FN
The area under the ROC curve (the AUC) is then a measure of how
good different algorithms are. The closer to one (the maximum possible)
the better. If you enter Kaggle competitions they often use the AUC to rank
competitors. (If you do enter Kaggle competitions after reading this book,
and I hope you will, don’t forget my 10% cut of winnings. You did read the
small print didn't you?)
Let's look at an example. Suppose we have data for age and salary of
a sample of lawyers then we might want to look for a linear relationship
between the two. Old lawyers earn more perhaps. We might represent the
th
n lawyer's age by and their salary by And we'll have N of them.
See Figure 2.6, although the numbers here are made up. Note that I haven't
bothered to scale the features, the xs. That’s because I’m not measuring
any distances between feature vectors here.
250
200
••
150
100
50
X
0
20 25 30 35 40 45 50 55 60 65 70
y = 9o-\-0ix, (2.4)
where the 9s are the parameters that we want to find to give us the best
fit to the data. (Shortly I’ll use 0 to represent the vector with entries being
these 6s.) Call this linear function A@(x), to emphasise the dependence on
both the variable x and the two parameters, 9o and 0|.
We want to measure how far away the data, the ?y^”^s, are from the
function ho{x). A common way to do this is via the quadratic cost function
N
1 2
(n)
.]{e) = 2N
n=l
-y
) (2.5)
This is just the sum of the squares of the vertical distances between the points
and the straight line. The constant in front, the makes no difference
to the minimization but is there by convention, and can also be important
when N is large, to scale quantities.
We want the parameters that minimize (2.5). This is called ordinary least
squares (OLS).
This is a popular choice for the cost function because obviously it will be
zero for a perfect, straight line, fit, but it also has a single minimum. This
30 CHAPTER 2. GENERAL MATTERS
N N
in) n) in)
^(00 +
n=l
-y
)= n=l
[eo + 9ix^ -y
) = 0.
(Ey) {H^V)
'0 =
and
91
^(E-^-y)-(E-t)(E?v)
Nj2x^-iY:xf
I have dropped all the (n) superscripts in these to make them easier to read,
it’s obvious what is meant.
The cost function as a function of 9q and 9\ is shown in Figure 2.7
using the same data as in Figure 2.6. Actually it’s the logarithm of the cost
function. This makes it easier to see the minimum. Without taking logs the
wings in this plot would dominate the picture. And the fitted line is shown
in Figure 2.8.
6
4
2
0
o
o
O CO
• O
KD
tb
rv
o
m oi
o
o
ro
theta_0 «X) ro
o I
ro
fNJ
(U
o
(N
Figure 2.7: The logarithm of the cost function as a function of the two
parameters.
2.8. COST FUNCTIONS 31
250
y = 3.3667x-5.2956
R^ = 0.9723
200
150
.*■
100
50
X
0
20 25 30 35 40 45 50 55 60 65 70
Many machine learning methods fit models by looking for model param
eters that minimize some cost function.
Later we'll see cost functions in higher dimensions, i.e. more explanatory
features (as well as the lawyer’s age we could also include the annual fee for
the school he went to, etc.).
In some problems minimizing a cost function and maximizing a likelihood
can lead to the same or similar mathematical optimization problems. Often
this is when some aspect of the model is Gaussian. And, of course, applying
MLE really requires there to be a probabilistic element to your problem,
whereas the cost-function approach does not.
Other cost functions are possible, depending on the type of problem you
have. For example if we have a classification problem — "This is an apple"
instead of "0.365" — then we won’t be fitting with a linear function. If it's a
binary classification problem then apples would be labelled 1, and non apples
would be 0, say. And every unseen fruit that we want to classify is going to
be a number between zero and one. The closer to one it is the more likely
it is to be an apple.
In such a classification problem we would want to fit a function that
ranged from zero to one, to match the values for the class labels. And
linear, as we’ l l see later, would be a silly choice. Typically we fit an S-shaped
sigmoidal function. Probably the most common example would be
1
ho{x)
I -|_ g-Oa-OiX ■
The quadratic cost function is just not suitable in this situation. Instead
we often see the following.
N
1
(n
j{e) = - N ^ In + (1 - y
n=l
32 CHAPTER 2. GENERAL MATTERS
This looks a bit strange at first, it's not immediately clear where this
has come from. But if you compare it with (2.1) then you’ll start to see
there might be a link between minimizing this cost function and maximizing
the likelihood of something (there's a sign difference between the two). The
analogy is simply that the ys represent the class of the original data (the
head/tail or apple/non apple) and the hg is like a probability (for y being 1
or 0). More anon.
This cost function is also called the cross entropy between the two proba
bility distributions, one distribution being the ys, the empirical probabilities,
and the other being he, the fitted function. We'll see more about the cross
entropy in a couple of sections' time.
Regularization
j{e) -y
in)
^ + X&1 ■
The addition of the second, regularization, term on the right encourages the
optimization to find simpler models. Note that it doesn't include the 9q
parameter. The larger the 0i the greater the change to the forecast for a
small change in the independent variable. So we might want to reduce it.
It's not easy to appreciate the importance of the regularization term
with a simple one-dimensional linear regression. Instead think of fitting a
polynomial to some data. If you fit with a high-order polynomial you will get
a lot of rapid wiggling of the fitted line. Is that real, or are you over fitting?
The regularization term reduces the size of all the coefficients, except for the
first one representing a constant level, thus reducing such possibly irrelevant
wiggles.
Back to the linear regression in one dimension, the minimum is now at
9o =
(E^)(A + E-^^)-(E^)ilZxy)
2
^(A + Ea;^)-(Ea;)
and
-^(Ea:y)-(EE(Ey)
1
A little warning, just because I've Just given a few explicit solutions for
where the minimum is don’t expect this to happen much in this book. More
often than not we'll be minimizing cost functions etc. numerically.
The idea of regularization is easily generalized to higher dimensions.
2.9. GRADIENT DESCENT 33
So you want to find the parameters that minimize a loss function. And
almost always you are going to have to do this numerically. We'll even
see an example later where there is technically an explicit solution for the
minimizing parameters but since it involves matrix inversion you might as
well go straight into a numerical solution anyway.
O
U
These
will converge to
here
The loss function J is a function of ail of the data points, i.e. their actual
values and the fitted values, in the above description of gradient descent we
have used al l of the data points simultaneously. This is called batch gradient
descent. But rather than use all of the data in the parameter updating we
can use a technique called stochastic gradient descent. This is like batch
gradient descent except that you only update using one of the data points
each time. And that data point is chosen randomly. Hence the stochastic.
Let me write the cost function as
j{0) = j2uo),
n=l
• Since the data points used for updating are chosen at random the
convergence will not be as smooth as batch gradient descent. But
surprisingly this can be a good thing. If your loss function has lo
cal minima then the bouncing around can bounce you past a local
minimum and help you converge to the global minimum.
And then there’s mini batch gradient descent in which you use subsets of
the full dataset, bigger than 1 and smaller than n, again chosen randomly.
Finally if you are using a stochastic version of gradient descent you could
try smoothing out the bouncing around, if it's not being helpful, by taking
an average of the new update and past updates, leading to an exponentially
weighted updating. This is like having momentum in your updating and can
speed up convergence. You’ll need another parameter to control the amount
of momentum.
2.10. TRAINING, TESTING AND VALIDATION 35
Epochs
In some machine-learning methods one uses the same training data many
times, as the algorithm gradually converges, for example, in stochastic gradi
ent descent. Each time the whole training set of data is used in the training
that is called an epoch or iteration. Typically you won’t get decent results
until convergence after many epochs.
One sees a decreasing error as the number of epochs increases, as shown
in Figure 2.10. But that does not mean that your algorithm is getting better,
it could easily mean that you are overfitting.
36 CHAPTER 2. GENERAL MATTERS
0 5 10 15 20
Epochs
This can happen if the algorithm has seen the training data too many
times, i.e. there have been too many epochs. To test for this you introduce
the test set of data, the data that you have held back. All being well you wi
get results like in Figure 2.11. The test set is not as good as the training set,
obviously, but both are heading in the right direction. You can stop training
when the errors have levelled out consistently. There is a caveat to this, if
the test error is much bigger than the training error then that could also be
a sign of overfitting.
Training
P Test Case 1
LU
0 5 10 IS 20
Epochs
Figure 2.11: Training and testing are looking good. Stop training when the
error levels out.
On the other hand if you get results like in Figure 2.12 where the test-set
error begins to rise again then you have overfitted.
2.11. BIAS AND VARIANCE 37
Training
Test Case 2
0 5 10 15 20
Epochs
Figure 2.12: Looks like we have overfitting.
y = fix)+ e.
Flere e is an error term, e has mean of zero (if it wasn't zero it could be
absorbed into f{x)) and variance cr^. The error term, whose variance could
also be x dependent, captures either genuine randomness in the data or noise
due to measurement error.
And suppose we find, using one of our machine-learning techniques, a
deterministic model for this relationship:
y = fix)-
38 CHAPTER 2. GENERAL MATTERS
Now / and / won't be the same. The function our algorithm finds, /, is
going to be limited by the type of algorithm we use. And it will have been
fitted using a lot of real training data. That fitting will probably be trying
to achieve something like minimizing
N
-y
n=\
The more complex, perhaps in terms of parameters, the model then the
smaller this error will be. Smaller, that is, for the training set.
Now along comes a new data point, x', not in the training set, and we
want to predict the corresponding y'. We can easily see how much is the
error in our prediction.
The error we will observe in our model at point x' is going to be
There is an important subtlety here. Expression (2.6) looks like it only has
the error due to the e. But it's also hiding another, possibly more important
and definitely more interesting, error, and that is the error due to what is
in our training data. A robust model would give us the same prediction
whatever data we used for training our model.
So let's look at the average error, the mean of (2.6), given by
E /(.t') /(•^')
where the expectation E[-\ is taken over random samples of training data
(having the same distribution as the training data).
This is the definition of the bias,
We can also look at the mean square error. And since fix') and e are
independent we easily find that
2
where
2
the bias of the method, the variance of the method and a term that we are
stuck with (the a^).
A good model would have low bias and low variance as illustrated in the
top left of Figure 2.13.
Low bias and low variance High bias and low variance
••
#•
•: «
Low bias and high variance High bias and high variance
Bias is how far away the trained model is from the correct result on
average. Where “on average" means over many goes at training the model,
using different data. And variance is a measure of the magnitude of that
error.
Let’s see how this works with a simple example. We’ll work with the
data in Figure 2.14. This shows the relationship between y and x, including
the random component e. The function f{x) is just a shifted sine wave,
and there’s a uniformly distributed random variable added to it. Each dot
represents a sample, and the full training set is shown.
••
/•
^ .•
••
Figure 2.14: The true relationship between y and x (a shifted sine wave),
also showing the random e.
Let’s start with a simple model for the relationship between x and y, i.e.
a simple f{x).
It doesn’t get any simpler than a constant. So in Figure 2.15 I have shown
our first model, it is just a constant, the dashed line. I have chosen for this
simple model the average y value for a subset of all the data. It doesn’t
actually matter which subset of the data because if I changed the subset the
average wouldn’t change all that much. This is clearly very underfitted, to
say the least.
Along comes an unseen sample, represented by the hollow dot in the
figure. But our forecast is the solid dot. There are three sources of error,
the error due to the e, the variance and the bias.
The e we are stuck with, we can't forecast that, although it is in the
training data and will therefore be implicitly within the model. The variance
(of the model) is tiny. As I said, if I used a different subset of the training
data then the model, here the average, would hardly change at all. So most
of the model error is the bias.
2.11. BIAS AND VARIANCE 41
epsilon
••
• Full training set m
Actual behaviour,f(x)
O Unseen sample
• Forecast
##
— — A simple model, underfitted
Subset 1
t/
O Unseen sample
• Forecast
Fitted to Subset 1
Subset 2
O Unseen sample
• Forecast
Fitted to Subset 2
Finally, two more plots, fitted to the same subsets but using a simpler
model, just a cubic polynomial. You see much less bias and variance now.
Subset 1
O 9
9
i
9 X
m
9 Subset 1
Poly. (Subset 1)
Figure 2.18: A good fit using a model that is not too simple or too complex.
2.11. BIAS AND VARIANCE 43
Subset 2
o%
^ «
A Subset 2 ■••■I*.
« «•
— Actual behaviour, f(x)
O Unseen sample
• Forecast
Poly. (Subset 2)
The trade off between bias and variance is shown in Figure 2.20.
Var
«
Total error
to
#
I
O I
$
I
A balanced #
#
model
^ «»>•
T T T T
Figure 2.21: The contour map of the function /(.Xi,xg) and the line
g{xi,X2) = 0.
2.13. MULTIPLE CLASSES 45
To explain why this works look at the contour map for our hill in Figure
2.21. Here we can see where the greatest value of / along the path is also
where, in the view from above, the path and a contour touch. Geometrically
this means that the vector that is orthogonal to the contour is in the same
direction as the vector that is orthogonal to the constraint or mathematically
df df \ oc
dg dg \
dx\ ’ 8x2/ dxi ’ 8x2)
But differentiating (2.8) with respect to x\ and X2 gives us this, with A being
the constant of proportionality. And differentiating (2.8) with respect to A
is just the constraint again.
Although I've explained the method in two dimensions it is applicable
in any number, and also with any number of constraints, just add to the
Lagrangian terms linear in the constraints, each having its own A.
The method can be generalized to inequality constraints. So we would
now want to maximize f{xi,X2) subject to g{xi,x,2) < 0. We can still work
with (2.3) but now
A >0
and
^g{xi,X2) = 0.
This last expression is called complementary slackness and says that either
g{xi,X2) < 0 and the constraint is satisfied so we don't need the A, or
g{xi,X2) — 0 and we are on the boundary of the permitted region and we
are back with the earlier version of Lagrange multipliers. This generalization
goes by the name of Karush-Kuhn-Tucker (KKT).
Be careful that Lagrange multipliers and KKT only give necessary con
ditions for a solution. So you will need to check that any solution they give
actually works.
Many of the techniques we shall be learning classify data into two groups,
an email is or isn't spam. For example the Support Vector Machine in its
simplest version divides sample data according to which side of a hyperplane
they lie. But what if we have more than one class, we have spam, not spam,
and emails from the mother-in-law?
There are three techniques commonly used: One-versus-one classifica
tion; One-versus-all classification; Softmax function.
But first let me tell you something that is not, and should not be, used.
Suppose you want to determine who a painting is by. You might show
your algorithm many paintings by various artists. You might want to use
46 CHAPTER 2. GENERAL MATTERS
numerical labels for each artist. Paintings by van Gogh might be labelled as
1, paintings by Monet as 2, by Cassatt as 3, and so on. And then you show
it a painting by an unknown artist. It gives you its numerical output. It is
2.7. Who is that painting by? Well, it seems to think that it's quite likely
by Cassatt but possibly by Monet. But that would be nonsense. There is
no ordering of artists such that you can say that Monet is halfway between
van Gogh and Cassatt. You would only have a single, scalar, output if the
classes could be ordered in such a way that the associated numbers are
meaningful. E.g. Which rides is a person allowed to go on at Disneyland?
There might be three types of ride, ones for toddlers, ones for children and
ones for adults. That’s three classes but, according to the signs I’ve seen on
rides, they correspond exactly to a person’s height. So that would be fine.
Generally however you don’t want to use a scalar to represent more than
two classes. But you still might want to use numbers for classification. What
can you do? Here are some common methods.
Softmax function
whether positive or negative, and turns them into a new array with values
between zero and one, and such that they sum to one:
K
Ek=\
People use different bases for the logarithm, but it doesn't make much
difference, it only makes a scaling difference. But if you use base 2 then
the units are the familiar bits. If the event is certain, so that p = 1, the
information associated with it is zero. The lower the probability of an event
the higher the surprise, becoming infinity when the event is impossible.
But why logarithms? The logarithm function occurs naturally in infor
mation theory. Consider for example the tossing of four coins. There are 16
possible states for the coins, HHHH, HHHT, ... , TTTT. But only four bits
of information are needed to describe the state. HTHH could be represented
by 0100.
4 = log2(16) = ~log2(l/16).
48 CHAPTER 2. GENERAL MATTERS
Going back to the biased coin, suppose that the probability of tossing
heads is 3/4 and 1/4 of tossing tails. If I toss heads then that was almost
expected, there's not that much information. Technically it's — log2(0.75) =
0.415 bits. But if I toss tails then it is -log2(0.25) = 2 bits.
This leads naturally to looking at the average information, this is our
entropy:
^plog2(p),
where the sum is taken over all possible outcomes. (Note that when there
are only two possible outcomes the formula for entropy must be the same
when p is replaced by 1 — p. And this is true here.)
For an unbiased coin the entropy is easily found to be 1. For a coin with
zero probability of either heads or tails then the entropy is zero. For the
75:25 coin it is 0.811. Entropy is a measure of uncertainty, but uncertainty
linked to information content rather than the uncertainty associated with
betting on the outcome of the coin toss.
2
\
\
1.5 \
0.5
0
0.1 0.2 .o.e—' 0.4 0.5 1
log^2(p)
-0.5
P log_2(p)-(l-p) log_2(l-p)
Expected winnings for 0.5 bet
-1
—--•Standard deviation of winnings
Figure 2.22: The information function, the entropy, expected winnings and
standard deviation. See text for details.
In Figure 2.22 I plot four quantities for the coin tossing, all against p
the probability of heads. There is the information function, the — log2(p),
the entropy plog2(p) — (1 — p)log2(l — p) and for comparison a couple of
lines usually associated with coin tossing. One is the expected winnings for
betting $0.5 on heads, p — 0.5, and the standard deviation of the winnings
■y/p(r^^p)2~+~(r^T^))p2. You can see that the standard deviation, the risk,
in the coin toss is qualitatively similar to the entropy.
Entropy is going to be important when we come to decision trees, it will
tel l us how to decide what order in which to ask questions so as to minimize
entropy, or, as explained later, maximize information gain.
2.14. INFORMATION THEORY AND ENTROPY 49
Suppose you have a model for the probability of discrete events, call
this where the index k just means one of K possibilities. The sum of
these probabilities must obviously be one. And suppose that you have some
empirical data for the probabilities of those events, With the sum again
being one. The cross entropy is defined as
fc
Pp = 0, pg = 1 and p^ = 0,
because it definitely is an orange. The cross entropy is thus
But since the sums of the two probabilities must be one we find that A = —1
and
Pk = Pk-
50 CHAPTER 2. GENERAL MATTERS
Because the cross entropy is minimized when the model probabilities are
the same as the empirical probabilities we can see that cross entropy is a
candidate for a useful cost function when you have a classification problem.
If you take another look at the sections on MLE and on cost functions,
and compare with the above on entropy you'll find a great deal of overlap
and similarities in the mathematics. The same ideas keep coming back in
different guises and with different justifications and uses.
Some of the methods we will look at are often used for Natural Language
Processing (NLP). NLP is about how an algorithm can understand, interpret,
respond to or classify text or speech.
NLP is used for
• Stop words: Some words such as a, the, and, etc. might not add any
information and can be safely removed from the raw text
• Stemming: One often reduces words to their root form, i.e. cutting
off their endings. Meaning might not be lost and analytical accuracy
might be improved. For example, all of love, loves, loving, loved, lover,
. . . would become lov
2.16. BAYES THEOREM 51
Word2vec
P{B\A)P{A)
P{A\B) =
P{B)
There are two ellipses here, with an overlapping area, and dots inside
them. Imagine that the left ellipse is red, the right blue, and the overlapping
area is thus purple. There are a total of 22 dots, ten in Ellipse A, 15 in
Ellipse B, and thus three in both ellipses, in the purple area.
Assuming a dot is chosen at random, the probability that a dot is in
Ellipse A is
10
22
P{AnB)= 22
—.
That is, out of the 15 in B there are three that are also in A. This is equal
to
3/22 _ P{AnB)
lb122 ^ P{B)
That's a simple example of
P{Ar\B)
P{A\B)
P{B) ■
But symmetrically
P{B n A)
P{B\A)=
P{A) ■
2.17. WHAT FOLLOWS 53
P{A\B) P{B\A)
P{A) ” P(B) ■
Having set the scene I am now going to walk you through some of the
most important machine-learning techniques, method by method, chapter by
chapter. I have chosen to present them in order of increasing difficulty. The
first methods can be done in a simple spreadsheet, as long as your datasets
aren't too large. As the methods get more difficult to implement then you
might have to move on to proper programming for anything other than trivial
examples. Although I am not going to cover programming in this book there
are very many books out there which do this Job wonderfully. In particular you
will find many books on machine learning using the programming language
du jour, Python.
All of the methods have worked illustrative examples. And they use real
data, not made up. Having said that, none of the results, such as they are,
come with any warranty or guarantee. Despite using real data the examples
are illustrative. Sometimes I've violated good-practice guidelines, for example
not worrying about the curse of dimensionality, in pursuit of a good story. In
some cases the end results seem encouraging, perhaps even useful. Please
also take those with a pinch of salt. Equally some of the results are perhaps
unexpected. I have left those as a lesson in how machine learning can throw
up the unexpected. Whether the unexpected results are meaningful though
would require further analysis.
My primary goal is to get you up and running with some machine learning
as soon as possible.
As the Ramones would say, "Hey, ho, let's go!"
Further Reading
Communication," The Bell System Technical Journal, \/o\. 27, pp. 379-423,
623-656, July, October, available online.
Chapter 3
K Nearest Neighbours
55
56 CHAPTER 3. K NEAREST NEIGHBOURS
A A
A
A A
#
•• A 4
O?^
A
4®
Figure 3.1: Does the square belong with the circles or the triangles?
A A
A
A
m
Choosing K
A big question is how to choose K. If it is small then you will get low
bias but high variance (see Chapter 2). A large K gives the opposite, high
bias but low variance.
58 CHAPTER 3. K NEAREST NEIGHBOURS
Step 0: Scaling
Computer issues In this book I rarely talk about hardcore computer issues
like memory and calculation speed. But I will make an observation here about
the K nearest neighbours method. Although the non-existent learning part
of the algorithm is lightning fast(!) the classification of new data points can
be quite slow. You have to measure the distance between the new point and
al l of the already-classified points.
Skewed data If there are many more data points of one class than others
then the data is skewed. In the above figures that might mean many more
circles than triangles. The circles would tend to dominate by sheer numbers.
3.6. EXAMPLE: HEIGHTS AND WEIGHTS 59
You can easily find data for men's and women's heights and weights
online. In Figure 3.3 I show sample data, just 500 points for each gender.
Note that the data has been shifted and scaled so as to have a mean of zero
and a standard deviation of one for each of height and weight. Obviously I've
used the same transformation for both genders. Because of the high degree
of correlation I could have done a simple PCA which would have had the
effect of spreading out the data. I haven't done that here for two reasons.
First, because I want to focus on the KNN method. And second, because in
higher-dimensional problems it might not be so obvious that there are any
patterns. I've also sketched a couple of ellipses to help you see roughly where
the different data points are.
• Men
0)
5
Women ~o 2
O)
♦ PW
TO
1/1
&3x
-3 -2 n 1> 2 3
Scaled height
OQ
PW
-2
o °
£13
-3
Figure 3.3: Fleights and weights for 500 men, 500 women, and one PW, all
scaled.
We can use KNN on a new sample to decide whether they are male or
female. The new sample is the diamond in the plot (it's me!). You can see
that I'm probably female, at least for small K. Maybe it's time to stop the
diet.
60 CHAPTER 3. K NEAREST NEIGHBOURS
JZ
oo
. Ven
Womon
3 ■2. 3
d*
S: teled height
T
.■M
mii -2
-3
When = 21 the boundary has been smoothed out and there is just a
single solitary island left. See Figure 3.5.
3
x:
: •
• Men I '
. : ■
WomcT I
m
*
*
m
-
r-\
• -
•3 ■1
*v4
-2
-3
f =- 5
*
w.
S*
Wniic-;
e
m
m
>
-3 K 3
height
:*41
•a
^9 m
-2
3' ii'i>
•3
Figures 3.4 and 3.6 nicely show the differences between bias and variance.
The former figure shows high variance but low bias. It's a complicated model
because it only looks at the nearest neighbour. Yes, I know that sounds like
it's a simpler model but it’s really not. The model changes a lot as we
move around because the nearest neighbours change a lot. The latter figure
shows high bias and low variance. It's a simpler model because the nearest
neighbours don't change so much as we look at new samples. It has a high
bias and low variance. In the extreme case in which we use the same number
of neighbours as data points then every prediction would be the same.
We can plot misspecification error as a function of the number of neigh
bours, K. I have done this in Figure 3.7 using out-of-sample data. It looks
like it' = 5 is optimal.
11.0%
10.5% ,
10.0%
o
^ 9.5%
UJ
9.0%
•••
8.5%
K
8.0%
0 20 40 60 100
0.8-1
lBO.6-0.8
5S0.40.6
15 0.2-0.4
0-0.2
3.7 Regression
I’ve explained how KNN can be used for classification but it can also
easily be used for regression.
The idea is simple. Instead of samples being labelled with classes, such
as type of fruit, the labels are numbers. The features might be age, IQ, and
height with the labels being salary. Your goal is to find the salary for a new
sample. You tell me your age, IQ and height and I’ll tell you how much you
earn.
inverse weighting might be a bit too extreme, if you have a new data
point that is exactly the same as one in your training set then it
will give exactly the same label, no room for perhaps a confidence
interpretation.
• Unused data
»
• Data used in the regression
Linear(Data used in the regression)
New
sample
X
Further Reading
There are not many books devoted to KNN in general, perhaps because
it is so straightforward. Here are a couple you might want to look at.
Algorithms for Data Science Hardcover by Brian Steele, John Chandler
and Swarna Reddy, published by Springer in 2017, covers the method, algo
rithms and also has exercises and solutions.
64 CHAPTER 3. K NEAREST NEIGHBOURS
If you want to read something with depth then you should look at Lectures
on the Nearest Neighbor Method (Springer Series in the Data Sciences) by
Gerard Biau and Luc Devroye, published in 2015. This book covers the
methods and provides a rigorous basis for them.
There are however many books and research papers that cover the ap
plication of KNN to specific problems.
Chapter 4
K Means Clustering
65
66 CHAPTER 4. K MEANS CLUSTERING
data point is assigned to is governed simply by its distance from the centres
of the clusters.
Since it's the machine that decides on the groups this is an example of
unsupervised learning. KMC is a very popular clustering technique. It is
highly intuitive and visual, and extremely easy to programme.
The technique can be used in a variety of situations:
• Looking for structure within datasets. You have unclassified data but
you suspect that the data fall naturally into several categories.
You might have data for car models, such as prices, fuel efficiency,
wheel size, speaker wattage, etc. You find that there are two natural
clusters, and they turn out to be cars that appeal to people with no
taste and cars that appeal to everyone else. The data might look like
in Figure 4.1. Here there are two obvious distinct groups.
••
.•
>1*
The data in Figure 4.2 might represent family income on the horizontal
axis with the vertical axis being the number of people in the household.
A manufacturer of cutlery wants to know how many knives and forks to
put into his presentation boxes and how fancy they should be. There
might not be obvious groups but that doesn't necessarily matter.
4.3. WHAT DOES K MEANS CLUSTERING DO? 67
• •• •
•••
•
••
Step 0: Scaling
As explained in Chapter 2 we first scale our data, since we are going
to be measuring distances.
Now we start on the iteration part of the algorithm.
We need to seed the algorithm with centres for the K clusters. Either
pick K of the N vectors to start with, or just generate K random
vectors. In the latter case they should have the same size properties
as the scaled datasets, either in terms of mean and standard deviation
or minimum and maximum. Call these centres for A; = 1 to K.
See the two diamonds in Figure 4.3.
.* o
»»•
Now for each data point (vector measure its distance from the
centres of each of the K clusters. We’ve discussed measuring dis
tances in Chapter 2. The measure we use might be problem depen
dent. If we have the house/postbox problem then we'd probably use
the Manhattan distance (unless you expect people to walk through
each other’s back gardens). But often we'd use the obvious Euclidean
distance:
Distance^”’''^ (u)
\ i:{
for A: = 1 to K.
m=l
Each data point, that is each n, is then associated with the nearest
cluster/centre:
argmin Distance^”’^\
k
.* o
Error
m
'A.
A
& I
0 1 2 3 4 5 6 7 8
Number of Clusters
Figure 4.7: Two ways that error could decrease with number of clusters. The
triangles are an example with an elbow.
If the error only falls gradually (the circles in Figure 4.7) then there is no
obvious best K. Here we don't see large a large drop in error followed by
a flattening out. That doesn't mean that there isn’t a natural grouping of
the data, for example you could have data like in Figure 4.8, but it will be
harder to find that grouping.
72 CHAPTER 4. K MEANS CLUSTERING
*• 1 » %
>« •
Figure 4.8: There is an obvious pattern here but it will take some work — a
change of coordinate system, for example — to find it.
Note that while convergence is typically quite rapid, the algorithm might
converge to a local minimum distance. So one would usually repeat several
times with other initial centres.
For our first real example 1 am going to take data for crime in England.
The data I use is for a dozen different categories of crime in each of 342
local authorities, together with population and population density data. The
population data is only for scaling the numbers of crimes, so we shall therefore
be working in 13 dimensions. Don't expect many 13-dimensional pictures.
The raw data looks like in Figure 4.9. The full list of crimes is:
Burglary in a building other than a dwelling
Burglary in a dwelling
Criminal damage
Drug offences
Fraud and forgery
Offences against vehicles
Other offences
Other theft offences
Robbery
Sexual offences
Violence against the person - with injury
Violence against the person - without injury
4.5. EXAMPLE: CRIME IN ENGLAND, 13 DIMENSIONS 73
Burglary in
a building Offences Population
other than Burglary in Criminal Drug Fraud and against per Square
Local Authority a dwelling a dwelling damage offences vehicles Population Mile
Adur 280 120 708 1S8 68 382 S8S00 3610
Allerdale 323 126 1356 392 79 394 96100 198
Alnwick 94 33 215 25 11 71 31400 75
Amber Valley 498 367 1296 241 195 716 116600 1140
Arun 590 299 1806 471 194 819 140800 1651
Ashficid 784 504 1977 352 157 823 107900 2543
Ashford 414 226 1144 196 162 99900 446
Aylesbury Vale 696 377 1490 502 315 833 157900 453
Basingstoke & Deane 1728 598 426 930 182 1159 147900 605
The crime numbers are first scaled with population in the local author
ities. And then there is the second translation and scaling as discussed in
Chapter 2.
A straightforward application of iC-means clustering results in the fol
lowing scree plot, shown in Figure 4.10.
5000
4500
4000
3500
3000
O
J- 2500
2000
1500
1000
500
0 1 2 3 4 5 6 7 8
Number of clusters
Cluster 1 has precisely one point. It is the City of London. In this data it
appears as an outlier because the population figures for each local authority
are people who live there. And not many people live in the City. We thus
can’t tell from this analysis how safe the City really is, since many of the
crimes probably happen to non residents. For example, Burglary from a
Dwelling is similar in both the City and in Cluster 2.
The other two clusters clearly represent dangerous local authorities (Clus
ter 2) and safer ones (Cluster 3). And what stands out to me, as someone
who spends most of his time in the countryside, is that the safe places are
the least densely populated. Phew.
This example also nicely illustrates an important point, specifically the
effect of scaling. Although there is nothing wrong in having a cluster con
taining a small number of data points, here there could possibly be an issue.
Outliers can easily mess up the scaling. Having a large value for a feature
in a small number of samples will usually cause that feature for the remaining
samples to be rescaled to pretty small numbers. Although this will depend on
the type of rescaling you use. And so when it comes to measuring distances
this feature will not fully participate. In the present case what I should now
do, if this were a research paper, is to remove the outlier, the City of London,
and redo the calculations.
The above is quite a high-dimensional problem, 13 features meaning 13
dimensions, compared to relatively few, 342, training points. We might have
expected to suffer from the curse of dimensionality mentioned in Chapter 2.
From the results it looks like, luckily, we didn't. The reason for this might be
the common one that the different features don’t seem to be independent.
Theft, robbery, burglary are very similar. Fraud and forgery, and sexual
offences are not.
4.6. EXAMPLE: VOLATILITY 75
If you wanted to reduce the dimensions early on, before using K means,
then you could use Principal Components Analysis. Or, as possibly here,
don't use features that common sense says are very closely related.
We are now going to do a few financial/economic examples. There is an
abundance of such data, for different financial or economic quantities and
in vast amounts. Some of it you might have to pay for, for example the
so-called tick data used in high frequency trading, but my examples all use
the free stuff.
0.9
0.8
O
>
0.7
0,6
0.5
0.4
0.3
0.2
0.1
Remember that this KMC analysis completely throws away any time
dependence in the behaviour of volatility. However I do have some motivation
for using K means on this data and that is that there is a model used in
finance in which volatility jumps from one level to another, from regime to
regime. And the volatility in this plot seems to behave a bit like this. You
see that often the volatility is pretty low, sometimes kind of middling, and
occasionally very large.
That's not exactly what we have here. Here there is a continuum of
volatility levels. But I'm not going to worry about that. I'm going to work
with three clusters, K = 3. Remember this is Just an illustrative example.
Let's see how it goes. I will find those three clusters and then take this model
a little bit further.
The algorithm very quickly finds the following clusters, see Table 4.2.
0.9
I
0,8
SPX vot. •SPX cluster
0.7
0.6
0.5
0,4
0.3
0.2
0.1
We can take this a step further and use it to give an idea of the likeli
hood of jumping from one volatility regime to another. And this is where I
cunningly bring back a very simple, weak time dependence.
Rather easily one finds the following matrix of probabilities, Table 4.3.
To:
Erom: Cluster 1 Cluster 2 Cluster 3
Cluster 1 84% 16% 0%
Cluster 2 38% 57% 5%
Cluster 3 0% 54% 46%
We interpret this as, for example, the probability of Jumping from Cluster
1 to Cluster 2 is 16% every 30 days.
40
Interest Rate
30
~—Inflation
20
f.
f
r
10
I 1 i si! ill III. ll.
I
i
i i
hi Vi^ V
[NJ
U(5*v» V 1900 s' ^ 1950 2000
-10
$
j.y I I
-20 I
I
-30
30
25
%
S
20
c
4^^
I
015
I
\
CD
^10 x:
>
5
0 u 5 10 15 20
-5 Interest Rate
Even though the interest rate and inflation numbers are ballpark the same
order of magnitude I still translated and scaled them first.
With four clusters we find the results in Table 4.4.
Table 4.4: Clusters in interest rates and inflation. (In original scales.)
The centres of the clusters and the original data are shown in Figure
4.15. I have here plotted in the scaled quantities, and with the same length
of axes, to make it easier to see (and draw) the lines dividing the nearest
cluster.
High Base
Rate, High
3
Inflation .
.o •
2
Figure 4.15: Inflation versus interest rate showing the four clusters. (Scaled
quantities.) This is a Voronoi diagram (see text).
O)
£ 3
13
C
01
V, 2
=5
0
1940 1960 1980 2000 2020
To:
From: Cluster 1 Cluster 2 Cluster 3 Cluster 4
Cluster 1 88.9% 11.1% 0.0% 0.0%
Cluster 2 5.6% 86.1% 8.3% 0.0%
Cluster 3 0.0% 25.0% 58.3% 16.7%
Cluster 4 0.0% 0.0% 33.3% 66.7%
Take the data in the previous example and add to it data for GDP growth.
We now have a three-dimensional problem. (And now I’ve used quarterly
data for this.)
With six clusters I found that three of the clusters that I had earlier
4.9. A FEW COMMENTS 81
did not change much but the normal economy broke up into three separate
clusters. See Figure 4.17 for the results plotted in two dimensions.
40
30
c 25
O
nj
20
High Base Rate, High Inflation
c •D
15
10
Normal’.EconorhvJ High.Base Rate,
5
I .*.•
..Megliui^lnflation •
..j t • *• *.
0
0 to
t •
Very LQiW Bast flate, 4 .
8 10 12 14 16 18
Norpgal Inflation
Interest Rate
-10
Figure 4.17: Inflation versus interest rate. (The GDP-growth axis would
come out of the page. Original scaling.)
The clusters are in Table 4.6.
Table 4.6; Clusters for interest rates, inflation and GDP growth.
Choosing K: If you are lucky then it will be clear from the error-
versus-ZsT plot what is the optimal number of clusters. Or perhaps the
number will be obvious, something about the nature of the problem
will be a clue. Not quite so convincing but it helps plausibility if you
can give each cluster a name, like I did in the first interest rate example.
Further Reading
83
84 CHAPTER 5. NAIVE BAYES CLASSIEIER
But we are going to apply NBC not Just to a single word but to whole
phrases and ultimately entire speeches. And also we don’t calculate the exact
probability of a politician being left wing. Instead we compare probabilities
for being left wing and right wing. You will see when I go through the details
in a moment.
5.5 In Symbols
F(Cfc|x) (5.2)
for each of the K classes (political persuasions) Ck- And whichever prob
ability is largest gives us our decision (that's just maximum likelihood as
mentioned in Chapter 2).
86 CHAPTER 5. NAIVE BAYES CLASSIFIER
P{Ck) n P{^m\Ck).
m=l
This is what we must compare for each class. The term P{xm\Ck) we will
get from the data, just look at the speeches of other politicians and look at
the probabilities of words, the x^s, appearing and which political direction
they lean.
Finally, because we are here multiplying potentially many small numbers
we usually take the logarithm of this expression. This doesn't change which
class gives the maximum likelihood just makes the numbers more manage
able:
M
ln(P(Cfe))+ ^lri(P(x™|Cfc)).
m=l
Pope and Czar, Metternich and Guizot, French Radicals and German police-
spies. ..
You will have noticed that I have removed a lot of stop words. So it’s
not as immediately recognisable as it might bel
For a bit of fun I took the speech and imagined doing a real-time analysis
of it, in the sort of way you might carefully listen to the report by a CEO
5.6. EXAMPLE: POLITICAL SPEECHES 89
and try to read into his presentation whether the news is good or bad before
quickly putting in your buy or sell order. And the results are shown in Figure
5.1. By 15 important words the speech has already set its tone. (Forty words
of actual speech.)
Probability of...
1
0.8
0.6
0.4
0.2
0 5 10 15 20 25 30 35 40
This is supposed to be the probability that the speaker is right wing. Can
you guess who it is?
Well, the person is not exactly famous for being right wing. And he was
not exactly a politician. But he is famous for giving good speech. It is,
of course, the "I Have A Dream” speech by Martin Luther King Jr. What
does this tell us? Maybe MLK was right wing. Maybe this analysis was
nonsense, too little data, too little follow up. Or maybe the classifier has
found something I wasn't looking for, it’s lumped in MLK with Churchill and
Thatcher, and not with Jeremy Corbyn, because of his powers of rhetoric.
There are many ways you can take this sort of analysis, but with much
of ML (that's Machine Learning, not Martin Luther) you have to treat it
with care and don't be too hasty. If this were anything but an illustrative
example I would have used a lot more data, more politicians, more speeches
and done a ful l testing and validation. However I shal l leave it here as I am
simply demonstrating the technique.
Further Reading
For a short and inexpensive overview of the subject of NLP see Intro
duction to Natural Language Processing: Concepts and Fundamentals for
Beginners by Michael Walker, published in 2017.
Chapter 6
Regression Methods
You will no doubt know about regression from fitting straight lines through
a set of data points. For example, you have values and square footage for
many individual houses, is there a simple linear relationship between these
two quantities? Any relationship is unlikely to be perfectly linear so what
is the best straight line to put through the points? Then you can move on
91
92 CHAPTER 6. REGRESSION METHODS
Although numerical methods are not needed when you have linear regres
sion in a single dimension you will need to use some numerical method for
anything more complicated. Batch gradient descent and stochastic gradient
descent have both been mentioned in Chapter 2. To use these methods you
will need
N
dJ _ 1 -y
(n)
de~ N
n=L
6.4. LOGISTIC REGRESSION 93
Step 1: Iterate
Step 2: Update
Update all 0k simultaneously. Repeat until convergence
Data
1 Sigmoidal
Linear(Data)
X
0
1 2 3 4 5 6 7 8
-1
One way to see why this is nonsense is to add another dot to this figure,
a dot way off to the right, with y = 1. It is clearly not conflicting with the
94 CHAPTER 6. REGRESSION METHODS
data that’s already there, if it had y 0 then that nnight be an issue. But
even though it seems to be correctly classified it would cause a linear fit to
rotate, to level out and this would affect predictions everywhere.
Also sometimes the vertical axis represents a probability, the probability
of being in one class or another. In that case any numbers that you get that
are below zero or above 1, which you will get with a linear fit, will also be
nonsense.
This function is shown in the figure. We would fit this function to the data
and then given a new data point (email) we would determine whether it was
spam or not according to a threshold for Iiq. We might have a threshold of
0.5, so that anything above that level would go into our spam box. Or we
might use a different threshold. If we are worried that genuine emails are
going into our spam box (we have friends who are very bad at speeling and
use lots of exclamation marks) then the threshold might be 0.8, say.
N
dj
de
n=l
This means that our gradient descent algorithm remains, surprisingly, un-
changed.
In the last chapter I used some data from the speeches and writings of
politicians to determine the nature of an unseen speech. I'm not sure that
it quite went according to plan, but was nevertheless a fascinating exercise,
I thought. I'm going to use the same data, the same eight speeches, here
but in a different way, using a regression method.
Again I shall take speeches/writings by = 8 politicians. And I shall
label each politician as either 0 for left wing or 1 for right wing. These
are the for n = 1,..., N. But instead of looking at the frequency of
individual words as before I shall look at the types of words used. Are the
words positive words, negative words, irregular verbs, etc.? For a total of
M features. So the n'^*' politician is represented by x^”), a vector of length
M + 1. The first entry is 1. The second entry would be the fraction of
positive words the rC'' politician uses, the third entry the fraction of negative
words, and so on.
But how do I know whether a word is positive, negative, etc.? For this I
need a special type of dictionary used for such textual analysis.
One such dictionary, easily found online, is the Loughran-McDona ld dic
tionary. This is a list of words classified according various categories. These
categories are: Negative; Positive; Uncertainty; Litigious; Constraining; Su
perfluous; Interesting; Modal; Irregular Verb; Flarvard IV; Syllables. This list
is often used for financial reporting (hence the litigious category). Most of
the category meanings are clear. Modal concerns degree from words such as
"always" (strong modal, 1) through "can" (moderate, 2) to “might" (weak,
3). Harvard IV is a Psychosociological Dictionary.
The results of the analysis of the speeches is shown in Table 6.1. The
fractions of positive, negative, etc. words is very small because the vast
96 CHAPTER 6. REGRESSION METHODS
majority of words in the dictionary 1 used are not positive, negative, etc.
Because I was only using eight politicians for the training I did not use
all 11 of these categories for the regression fitting. If I did so then I would
probably get a meaningless perfect fit. (Twelve unknowns and eight equa
tions.) So I limited the features to Just positive, negative, irregular verbs and
syllables.
I found that the 9s for positive words, irregular verbs and number of
syllables were all positive, with the 9 for negative words being negative. I
leave the reader to make the obvious interpretation.
However, you should know the disclaimer by now!
Finally I thought I would classify my own writing. So I took a sample
from that other famous political text Paul Wilmott On Quantitative Finance,
second edition. And I found that I was as right wing as Margaret Thatcher.
Take from that whatever you want.
But seriously, if you were to do this for real you would use the above
methodology but improve on its implementation in many ways. A couple of
obvious improvements would be;
You can go a lot further with regression methods than the techniques
I've covered here. Just briefly. . .
6.6. OTHER REGRESSION METHODS 97
There are many (basis) functions in common use for fitting or approxi
mating functions. You will no doubt have used some yourself. For example,
Fourier series, Legendre polynomials, Flermite polynomials, radial basis func
tions, wavelets, etc. All might have uses in regression. Some are particularly
user friendly being orthogonal (the integral of their products over some do
main, perhaps with a weighting function, is zero).
Polynomial regression
^0 T
think of
^0 T (^\X\ + 02X2-
Because of the obvious correlation between x and x'^ it can be tricky to
interpret exactly what the coefficients in the polynomial mean.
Ridge regression
And similarly for other cost functions. Here I am using 0 to mean the vector
with the first entry being zero (i.e. there is no 9q). And for comparison with
below, this is the L^ or Euclidean norm here. This extra penalty term has
the effect of reducing the size of the (other) coefficients.
Why on earth would we want to do this? General ly it is used when there
is quite a strong relationship between several of the factors. Fitting to both
height and age might be a problem because height and age are strongly
related. An optimization without the regularization term might struggle
because there won't, in the extreme case of perfect correlation, be a unique
solution. Regularization avoids this and would balance out the coefficients
of the related features.
98 CHAPTER 6. REGRESSION METHODS
Lasso regression
LASSO stands for Least Absolute Shrinkage and Selection Operator. This
is similar to ridge regularization but the penalty term is now the L^ norm,
the sum of the absolute values rather than the sum of squares.
Not only does Lasso regression shrink the coefficients it also has a ten
dency to set some coefficients to zero, thus simplifying models. To check
this last point, plot \9\ = constant in 0 space (stick to two dimensions!)
and draw comparisons between minimizing the loss function with penalty
term and minimizing the loss function with a constraint (see Chapter 2 and
Lagrange multipliers).
Further Reading
99
100 CHAPTER 7. SUPPORT VECTOR MACHINES
In Figure 7.1 are shown sannple vectors divided into two classes (the
circles and the diamonds). There are very clearly two groups here. These
classes can be divided using a straight line, they are linearly separable. This
dividing straight line can then tell us to which class a new, unclassified,
sample belongs, depending on whether it is on the circle or the diamond side
of the line. With such a clear distinction between these groups it should be
easy to find such a line. It is, but it is so easy that there are many lines that
will do the job, as illustrated in the figure. So which is the best one?
One definition is that the best line is the one that has the largest margins
between it and the samples. This is shown as the bold line in the figure.
Figure 7.1: Two classes and three possible borders between them.
These margins are shown in Figure 7.2, where I've also shown the vector
that is orthogonal to the hyperplane dividing the two classes. I say hyper
plane because we will generally be in many dimensions, here of course this
hyperplane is Just the straight line.
Our goal is to find the hyperplane such that the margins are furthest
apart. The cases that define the margins are called the support vectors.
You'll notice that this section is called Hard Margins. This is a reference
to there being a clear boundary between the two classes. If one sample strays
into the wrong region, a diamond over to the circle side, say, then we will
need to do something slightly different from what immediately follows. And
we’ll look at that in the Soft Margins and Kernel Trick sections.
7.3. HARD MARGINS 101
Figure 7.2: Margins and the vector orthogonal to the dividing hyperplane.
6»'^x + 6>o = 0,
And that's where the scaling happened. The + refers to the margin close to
the diamonds, and the — to the margin close to the circles.
From (7.1) we have
0^(x+ -X")= 2. (7.2)
102 CHAPTER 7. SUPPORT VECTOR MACHINES
We also know from Equation (7.2) that (you might have to refresh your
memory about the meaning of the inner product of two vectors in terms of
distances and angles)
2
margin width =
i^r
And so maximizing the margin is equivalent to finding 0 and 6q to minimize
|0| or
1
0^ u + 00 (7.6)
and depending on whether this is positive or negative we have a diamond or
a circle. This is known as the primal version of the classifier.
We shall take a break from the mathematics here and implement a simple
example in Excel.
SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5,3.6,1.4,0.2,Iris-setosa
In Figure 7.3 are shown the 50 samples for each type of iris, the data
being just petal length and sepal width. (I can't plot in four dimensions!)
4.5
4
%
3.5
T3 3
5 2.5
2
OJ
1.5
1 # Iris-setosa
a Iris-versicolor
0.5
A Iris-virginica
0
0 1 2 3 4 5 6 7
Petal length
The two margins are given by the same equation but with a plus or minus
one on the right.
104 CHAPTER 7. SUPPORT VECTOR MACHINES
Q. 3
cu
LO
2
0
0 1 2 3 4 5 6
Petal length
Now given a new iris with values for sepal width and petal length we
can determine whether it is setosa or versicolor by plugging the numbers
into Expression (7.6), or equivalently the left-hand side of the above, and
seeing whether the number that comes out is positive or negative. A positive
number for the left-hand side means you have a setosa. If the number has
absolute value less than one it means we have an iris in the middle, the
no-man's land. We wil l still have a classification but we won’t be confident
in it.
As a little experiment see what happens if you move one of the data
points, say a setosa, clearly into the versicolor group. When you try to solve
for the dividing hyperplane you will find that there is no feasible solution.
n=l
+
9.) -1) (7,7)
result is pretty much what you would expect. Setting these derivatives to
zero you end up with
N
n) = 0
(7.8)
n=l
and
=0.
n=l
X
(7.9)
n=l
which means that our vector orthogonal to the hyperplane is just a linear
combination of sample vectors.
Substituting Equation (7.9) into (7.7) and using (7.8) results in
N N N
^=E n=l
an
(7.10)
i=l j=l
We want to maximize L over the as, all greater than or equal to zero,
subject to (7.8). This is known as the dual problem.
Once we have found the as the dual version of the classifier for a new
point u is then Just
N
^a„y(")x(”)'uT0o,
n=l
4
5
Q. 3
01
0 1 2 3 4 5 6
Petal length
When \we solve the dual problem for the two-dimensional, two-class, iris
classification we find that a for the versicolor support vector shown as the
hollow diamond in Figure 7.5 is 1.025, for the upper hollow circle is 0.810,
and for the lower hollow circle 0.215. And indeed all other as are zero.
The end result is quite simple in that the decision boundary is defined
using just a small subset of the sample vectors. This makes classification of
new data very quick and easy.
If there is clear water between your groups then SVM works very well.
What if our data is not linearly separable? I'm going to explain two
possible ways to address this situation. (In practice you might use both
methods simultaneously.) In the first method we try to do the best we can
while accepting that some data points are in places you'd think they shouldn't
be. It's all about minimizing a loss function.
In the second method we transform our original problem into higher
dimensions, thereby possibly giving us enough room to find genuinely linearly
separable data. So, to the first method.
7.6. SOFT MARGINS 107
• This looks quite a lot like Equation (7.7), but without the different as,
with a maximum function, and a change of sign
• In the language of optimization the first term is the loss function and
the second term is the regularization
• The maximum function in the first term has the effect of totally ig
noring how far into the correct region a point is and only penalizes
distance into the wrong region, measured from the correct margin.
The plot of the maximum function against its argument is supposed
to look like a hinge, giving this loss function the name hinge loss
Gradient descent
New 6o = Old 6q — /3
-y
(n) if 1 - yin) (e'l'^in) + >0
0 otherwise
and
2X9
if 1 -
otherwise
(^0'x(”) + 6lo) >0
Technically you might want to call this sub-gradient descent because the
derivatives aren't defined at the hinge.
In Figure 7.6 I've taken the earlier iris data and moved one setosa deep
into versicolor territory. The resulting margins have become wider, and are
softer since a few data points are within those margins.
You can also use the soft-margin approach even if you have linearly sep
arable data. It can be used to give wider margins but some points will now
lie within the margins.
108 CHAPTER 7. SUPPORT VECTOR MACHINES
7
/
/
/
6 /
/
/
/
5 /
JZ
■O
4
y
ro ,
Q. 3
O)
CO
2
0
0 1 2 3 4 5 6
Petal length
We might have data that is not linearly separable but which nonethe
less fal ls neatly into two groups, just not groups that can be divided by a
hyperplane. A simple example would be that shown in Figure 7.7.
a
% m
&
1
^4^
•a » t
a •
• a a
For the data in Figure 7.7 we could go to three dimensions via the trans
formation
(xi,.T2) ^ {xuX2,xl+xl).
This would have the effect of lifting up the outer group of dots higher than
the inner group in the new third dimension, making them linearly separable.
But once we start moving into higher dimensions we find that the training
procedure gets more time consuming. Is there any way to take advantage of
higher dimensions without suffering too much from additional training time?
Yes, there is. And it follows on from two observations.
First observation: Thanks to the dual formulation, in (7.10), we can see
that finding the as depends only on the products
Second observation: If you plug (7.9) into (7.6) you can see that the
classification of new data only depends on the products u.
and
0(xW)'^^(u),
for our transformed data.
So rather than needing to know the exact data points we actually only
need a lot less information: The products of the vectors. And this leads on
to an idea that exploits these observations to address problems that are not
immediately linearly separable. And this idea is powerful because we can go
to many, perhaps even infinite, dimensions without adding much numerical
complexity or programming time.
The function 0(x) is known as a feature map. And what is clever about
the feature-map transformation is that you don’t need to explicitly know
what the map is in order to exploit it!
The trick in Kernel Trick is to transform the data from our M-dimensional
space into a higher-dimensional space where the data is linearly separable.
Let me show how this idea works with an example. Specifically let's work
with the example of ^(x) transforming a two-dimensional vector {xi,X2)'^'
into a six-dimensional vector
|x-x'|
/C(x, x') = exp 2a
, exponential kernel
N N N
The observant will look at the above examples of kernel functions and
say "I can see the product in the polynomial kernel, but I don’t see it in the
Gaussian kernel.” Well, it is in there but we need to do a little bit of work
to see it.
First of all, what do we mean by the product of two vectors in the present
context? Let's write 4>{x.) in long hand:
4>(x.) = {4>i(Xi,..., Xn),(f>2(Xl,...,.X„),...,(prnixi X^)).
Notice how we’ve gone from ntom>n dimensions. So the product we are
interested in is
m
ii ■
i=l
The key element in this is that the sum is made up of products of the same
function of the two initial coordinates {xi,...,Xn) and (xj,... ,x^). So our
goal is to write the Gaussian kernel in this form. Here goes
X —
xf / - 2x''x'+ x'\
exp = exp
2(t2 2ff2
V
T T
X^ X 7x^x' x'
= exp exp exp
2cr2 cr2 2a2
The first and last terms in this are of the form that we want. But not; yet,
the middle. That is easily dealt with by expanding in Taylor series. The
middle term is then
OO
^ 1 /x:^x'\ *
2^ i\ V O'2
«=0
And the job is done, because each of these terms is a polynomial kernel. So
indeed the Gaussian kernel does have the required product form, albeit as
an infinite sum and therefore in infinite dimensions.
Note that some kernels take you from a finite-dimensional space to an
other finite-dimensional space with higher dimension. The polynomial kernel
is an example of this. However others take you to an infinite-dimensional
space.
If we go back to the iris problem in which I moved one setosa into the
versicolor region then using a kernel might be able to make this a separable
problem. However in so doing the boundary of the two regions would in
evitably be very complex. We might find that classifying unclassified irises
would not be very robust. It would almost certainly be better to treat the
odd iris as an outlier, a one off. So if you have data that is only linearly
inseparable because of a small number of outliers then use soft margins. If
however the original problem is not linearly separable but there is structure
then use a kernel.
112 CHAPTER 7. SUPPORT VECTOR MACHINES
Further Reading
Self-Organizing Maps
113
114 CHAPTER 8. SELF-ORGANIZING MAPS
The data
Group A v(i)
Group B V(2)
X, Group D y(4)
Group E y(5)
Remember that all of and the vs are vectors with M entries. All we
do is to measure the distance between item n and each of the K cell vectors
v:
|x(")
rn=l
argrnin
k
is called the Best Matching Unit (BMU) for that data point n.
Qt
Measure the distance between the BMU cell and other nodes, ac
cording to the obvious
,(fc) -F
/? (x("‘> - for all 1< k<K.
118 CHAPTER 8. SELF-ORGANIZING MAPS
Step 2; Iterate
But that's what everyone does. The second of these is the popular (at
least in the text books) Modern Portfolio Theory (MPT).
MPT is a clever, Nobel-Prize winning, way of deciding which stocks to
buy or sell based on three types of statistical information: Each stock’s ex
pected return; Each stock’s volatility; The correlations between all stocks'
returns. In this method one aims to maximize the expected return of a port
folio of stocks while holding its volatility, or risk, constant. To achieve this
it exploits correlations between stocks. If two stocks both have a good ex
pected growth but those returns are uncorrelated then spreading your money
across a portfolio of the two of them will be better than holding just one.
And so on for portfolios of any number of stocks. The problem with this
elegant theory is that in practice the parameters, especially the correlations,
are very unstable.
Perhaps SOM might be a more stable method? It is also nonlinear gen
erally, so might capture something that MPT doesn’t.
However, having asked that question I’m not going to fully answer it. I
shall use SOM on stock returns data but won't go too far in the direction
of portfolio selection. That is more for a finance research paper than for
illustrating machine learning.
What I am going to do is a slightly unusual example of self-organized
maps. Usually one would have significantly different features for each stock
(Just like we had different crimes for each local authority when we did K
Means). Perhaps we would have the sector, market capitalization, earnings,
gender of the CEO, etc. mentioned above. But I want to tie this in more
closely to the MPT of classical quantitative finance. For that reason I am
going to use annual stock price returns as a feature. So for each stock I will
have a vector of five features, the last five annual returns. I have chosen
annual data because it fits in with the sort of timescale over which portfolio
allocation methods are used, one doesn’t usually reposition one’s portfolio
every day.
I will use annual returns for each of 476 constituents of the S&P index.
(“Why not 500?" you ask? "Because some stocks left the index and
others Joined during this period," I reply. Since some stocks have left the
120 CHAPTER 8. SELF-ORGANIZING MAPS
A note on scaling Because I've chosen features which are very similar,
just returns over different years, scaling is not as crucial as it would be with
features that are significantly different. So in what follows I have left all data
unsealed.
Let's see what the algorithm gives us.
In Figure 8.6 we see how many of the 476 stocks appear in each node,
with the corresponding contour plot in Figure 8.6. Clearly some nodes are
more popular than others. In those nodes there are a lot of stocks that move
closely in sync.
60
50
40
30
20
10
0 1
0 1
mu-zu
The most popular node, the one with the most stocks, is number 20. Its
constituents are shown in Table 8.1.
Inspired by MPT one could easily use the growth and volatility of each
stock together with the map structure to optimize a portfolio in a way that
would be different from classical methods. For example diversification could
be achieved by not buying stocks that are in nodes that are close to each
other. In Figure 8.7 is shown the average in each node of the ratio of the
stocks’ growths to the volatilities. Perhaps these numbers will be useful. I
shall leave these possibilities as an exercise for the reader. (But please don’t
122 CHAPTER 8. SELF-ORGANIZING MAPS
I want to finish this example by looking at what this solution has to say
about financial sectors. In Figure 8.8 I show the number of stocks from each
sector in each cell.
Node 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Consumer Discretionary 1 4 2 0 4 3 3 2 4 3 3 0 0 1 5 0 2 1 7 6 2 1 14 3 3
Consumer Staples 0 2 7 1 0 1 2 0 1 1 0 2 1 1 2 0 0 0 0 1 0 1 4 4 1
Energy 1 0 0 0 0 0 0 0 3 3 0 0 0 0 2 0 0 0 0 18 1 0 0 1 1
Financials 0 1 0 0 0 1 4 4 1 1 0 1 0 2 1 0 7 2 0 5 25 3 3 1 1
Healthcare 0 0 1 0 4 8 4 0 2 0 2 0 0 1 3 4 1 3 1 1 1 0 8 11 3
Industrials 0 2 0 1 0 3 6 4 6 2 1 0 0 0 1 1 6 2 0 11 10 2 4 1 1
Information Technology 3 0 0 0 3 6 11 2 2 1 1 1 0 0 3 2 5 0 1 8 9 3 0 1 1
Materials 1 0 0 1 1 0 1 1 0 2 0 1 0 0 0 0 0 2 1 6 3 1 1 1 0
Real Estate 6 8 6 1 0 0 0 0 0 0 4 0 0 1 0 1 0 0 0 1 0 0 0 1 2
Telecommunication Services 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Utilities 2 6 7 1 0 0 0 0 1 0 0 2 3 2 0 0 0 0 0 2 1 0 0 1 0
Figure 8.9 shows this as a 3D plot. Node numbers from 1 to 25 run left
to right and sectors go into the page.
8.5. EXAMPLE: GROUPING SHARES 123
ss Consumer Discretionary
M Consumer Staples
^ Energy
PPinsncials
SI Health Care
If ^ Industrials
M Information Technology
r Materials
Real Estate
Telecommunication Services
Utilities
Node number
It's not easy to appreciate this in black and white but you can get a rough
idea of what is going on, especially if you also look at the contour map in
Figure 8.10.
There are clearly some sectors that are very focused: Financials are con
centrated around Node 21; Real Estate around Nodes 1-3. Energy around
Node 20.
There are some sectors that are very spread out, such as Consumer Dis
cretionary.
Consumer Discretionary
i- Consumer Staples
Energy
1 Financials
[\
f---f Health Care
Industrials
Information Technology
Materials
Real Estate
Telecommunication Services
.1.
— Utilities
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Figure 8.10: The number of stocks from each sector in each cell. The
contour map.
Note that I never told the algorithm anything about sectors. But it does
look as if SOM has found something rather sector-like, but only for some
sectors.
124 CHAPTER 8. SELF-ORGANIZING MAPS
300
250
200
150 I...
■'--ic.l.'. z-
100
9
<4.
50
■f- 7
5
0
T*'-
0 V;
1 3
2
3
4
5
6
7 ' 1
The second figure shows the breakdown of the nodes by political party.
Conservative and Labour are predominantly in completely different areas of
the grid. However Labour are a little bit more spread out. Interestingly,
there are members of the main parties scattered about, not being where
you'd expect them to be. It would be interesting to look at those MPs who
don't seem to be affiliated with the correct party and why. Note that nodes
11, 21, 31, etc. are all close together in the grid.
Conservative
200
180 Labour, ,,
160
140
120
100
60
40
20
0
1 11 21 31 41 51 61 71 81 91
Node number
Obviously I have used this data for illustrative purposes. If this were
being done for real I would take a lot more care, such as testing the results,
and also probably break down the data according to things like the subject
of the vote.
Further Reading
Decision Trees
You will no doubt have seen decision trees before, although perhaps not in
the context of machine learning. You will definitely have played 20 Questions
127
128 CHAPTER 9. DECISION TREES
before. “Is the actor male?" Yes. Is he under 50?” No... “Is he bald?'
Yes. "Bruce Willis?” Yes. If the answer to the first question had been “No”
then you would have taken a different route through the tree.
In 20 Questions the trick is to figure out what are the best Yes/No
questions to ask at each stage so as to get to the right answer as quickly as
possible, or at least in 20 or fewer questions. In machine learning the goal
is similar. You will have a training set of data that has been classified, and
this is used to construct the best possible tree structure of questions and
answers so that when you get a new item to classify it can be done quickly
and accurately. So decision trees are another example of supervised learning.
The questions in a decision tree do not have to be binary, and the answers
can be numerical.
I would not recommend using Excel for decisions trees when you have a
real problem, with lots of data. It’s just too messy. That's because you don't
know the tree structure before you start, making it difficult to lay things out
in a spreadsheet.
First some Jargon and conventions. The tree is drawn upside down, the
root(the first question) at the top. Each question is a condition or an internal
node that splits features or attributes. From each node there will be branches
or edges, representing the possible answers. When you get to the end of the
path through the tree so that there are no more questions/decisions then
you are at a leaf. There are also the obvious concepts of parent and child
branches and nodes. See Figure 9.1.
1 for tftefeature
Answer 1 b
Feature 2 Feature 3
Answer 3 a
Answer 2
a^ Answer J b ^Answe
r 2c
Feature 4
V”
Answer 4 a, Answei 4 b
You can use decision trees to classify data, such as whether or not a
mushroom is edible given various physical features. Those features could
have binary categories such as with/without gills, or multiple categories.
9.3. EXAMPLE: MAGAZINE SUBSCRIPTION 129
I am going to work with member data from my own website now. I shall
be taking data for a small subset of members to figure out which people are
likely to subscribe to our magazine. (And along the way I'll be shamelessly
promoting both the website and the magazine.) I will not be violating any
data-protection laws.
There are three features of wilmott.com members we wil l look at: Em
ployment Status; Highest Degree Level; CQF Alumnus (whether or not they
have the Certificate in Quantitative Finance qualification). And the classifi
cation is whether or not they are magazine subscribers. The goal is to use
information about those features for new members to determine whether or
not they will be subscribers of the magazine. Wilmott
CQF Magazine
ID Employment Status Degree Level Alumnus Subscriber
1 Self Employed Postgraduate No No
2 Self Employed Postgraduate Yes Yes
3 Employed Postgraduate Yes Yes
4 Student/Postdoc. Postgraduate No Yes
5 Student/Postdoc. Undergraduate Yes Yes
6 Student/Postdoc. Undergraduate Yes No
7 Employed Undergraduate Yes Yes
8 Self Employed Postgraduate No No
9 Self Employed Undergraduate No Yes
10 Student/Postdoc. Undergraduate Yes No
11 Self Employed Undergraduate Yes Yes
I am only going to work with a small subset of the full magazine dataset,
so just 17 lines. See Figure 9.2. Out of these 17 members there are ten who
130 CHAPTER 9. DECISION TREES
Employment Status
nm I I
PURE!
We read this as follows. In the boxes are two numbers. The number
on the left is the number of people who are magazine subscribers and the
number on the right those who aren't. Is this useful?
Yes and no. If they say they are Employed then this is very useful because
five employed people say they are magazine subscribers and none say they
aren’t. That is very useful information because as soon as we know someone
is employed we know that they will be subscribers. And no more questions
need be asked.
If only it were true that all employed people subscribed to the magazine.
Sadly, this data has been massaged as I mentioned. In practice you would
9.3. EXAMPLE: MAGAZINE SUBSCRIPTION 131
have a lot more data, you'd be unlikely to get such a clear-cut answer, but
you’d have a lot more confidence in the results. However even with these
numbers we can start to see some ideas developing. When you get an answer
that gives perfect predictability like this it is called pure. That is the best
possible outcome for a response. The worst possible outcome is if you get
the answer Self Employed because there is no information contained in that
answer, it is 50:50 whether or not a self-employed person is a subscriber.
In fact we seem to have gone backwards, at least before we had asked any
questions we knew that ten out of 17 people were subscribers, now we are
at a coin toss.
Employment Status
No
nun
THE END
We can move further down the tree. We can ask those two non-alumni
Students/Postdocs what their highest degree is. Unfortunately that does not
help here because both have the same level, they have postgraduate degrees.
There's nothing we can do with their answers to the three questions that
will separate these two individuals.
132 CHAPTER 9. DECISION TREES
Employment Status
I 2/4 I I I
PURE!
CQF Alumnus
No
THE END
Emptoyment Status
Yes No Postgrad.
Undergrad.
r~uri
THE END
I I
Degree level The remaining question PURE!
mr-\ r~^ I I
THE END PURE! PURE! PURE!
Some observations...
• Ideally you will end with a leaf that is a pure set, which classifies
perfectly
• However if you have an impure set you might find that the remaining
questions do not help to classify
• You should always keep track of the numbers (in the boxes) because
that will give you a probabilistic classification and a degree of confi
dence
• Notice how the same questions can appear at different places in the
tree, although there will be no path along which you get asked the
same question twice
9.4 Entropy
In layman's terms you want to find the attribute that gives you the highest
information gain. I reckon that out of Employment Status, Highest Degree
Level and whether or not a person is a CQF Alumnus the attribute that does
the splitting the best, that gives the highest information gain, is Employment
Status. But what makes me think that?
We need some way of looking at which attribute is best for splitting the
data based on the numbers in each class. Referring to Figure 9.3 we need
some numerical measure that gives us how efficient the split is based on
the numbers 2 / 4, 3 / 3 and 5/0. But first we need to see what the
Employment Status attribute is competing with. In Figure 9.7 we see the
splits you get from having a root that is either CQF Alumnus or Highest
Degree Level.
Undergrad. Postgrad.
I I rrfT-]
It looks like both of these splits do a very poor job. Both of them have
one branch that is a coin toss. And in both cases the other branch is not
much better. But we need to quantify this.
The uncertainty in classification is going to be measured by the entropy.
We saw this, and the Justification for it, in Chapter 2. All we need to do to
determine the best attribute to start with, the root of our tree, is to measure
the entropy for each attribute and choose the one with the lowest value.
Usually this is done by measuring the information gain.
To measure the gain we need a starting point to measure gain relative
to. And that would be the raw data before it is split. We calculate the
entropy for the original data, that is ten magazine subscribers and seven non
subscribers:
9.4. ENTROPY 135
-^plog2(p)=
10 7 \
10 + 7
log2
(i^)- ( 10 + 7y
= 0.977.
For the Self Employed branch of Employment Status we have a rather obvious
1, since there is an equal number of subscribers and non. And for the
Employed branch the entropy is zero since the split is pure.
Now we measure the average entropy for Employment Status as
6 6 5
X 0.918 + X 1+ X 0 = 0.677.
17 17 17
That's because six of the 17 are Student/Postdoc. , another six of the 17 are
Self Employed and five are Employed.
Thus the information gain thanks to this split is
We do exactly the same for the Highest Degree Level and CQF Alumnus
attributes:
Take one of the (remaining) attributes and branches and split the
data.
Z. N \ NJ
Where the sum is over al l the relevant branches, N is the number of samples
in the parent node and Ni is the number of samples at the end of each
branch. So for the data in Figure 9.3 we would divide the information gain,
9.5. OVERFITTING AND STOPPING RULES 137
0.300, by
6 (6 6 5\
log2 - TT log2 = 1.58.
17 Vl7 17 VI7; VI)
This gain ratio allows for different numbers of partitions.
There are also different measures of uncertainty. A common one, besides
entropy, is Gini impurity. Gini impurity measures how often something would
be incorrectly labelled if that labelling was random. If we have probabilities
Pi for being in set i then the probability of mislabelling is 1 — Pi. The Gini
impurity is the average of this over al l elements in the set;
-Pi) = 1 -
If you have many attributes it is possible that by the time you get down
to the leaf level there are very few records. The danger here is that your
results are not representative of other, non-training, data. In other words,
you have overfitted and all that your decision tree is doing is memorizing
the data it has been given. To reduce the chance of this happening it is
common to introduce stopping rules that say that when you get down to a
certain number of records (either an absolute number or as a fraction of the
whole dataset) you stop splitting the data and create a leaf. This way your
results can remain statistically significant. This just means that you end up
with a reliable probabilistic classification rather than an unreliable, possibly
deterministic, one.
This is also the classic problem of fitting training data well, but doing
badly on test data.
A stopping rule can also help when you have a new data point that you
want to classify but there was no data point in the training set with the same
features.
9.6 Pruning
What can we do if the answers to our questions are not categories but
numerical? Suppose that we have data for people’s heights, and whether or
not they are magazine subscribers, such as that shown in Figure 9.8. The
height data here is entirely made up but to make things interesting I have
naturally assumed that magazine subscribers tend to be taller, and more
attractive.
Wilmott
CQF Magazine
ID Employment Status Degree Level Alumnus Heights Subscriber
1 Self Employed Postgraduate No 174.6 No
Figure 9.8: Same data as before but with added (fictitious) height informa
tion.
9.8 Regression
Let’s look at the data shown in Figure 9.9. Here we have recent auction
prices for several Peugeot Partner Tepees (no, me neither, but 1 think they
are cars or vans) together with various features, some are categories such
as transmission type and some are numerical, mileage for example. I have
tweaked this data only a tiny amount and simply to illustrate several things
that can happen in building the tree.
140 CHAPTER 9. DECISION TREES
th
The dependent variable will be representing the value of the n
car, say. For each attribute we are going to look at the sum of squared error
across the branches. But, crucially, each branch wil l have a different model
for the forecast.
To see what I mean let's start with perhaps the simplest attribute,
whether the car is manual or automatic transmission.
Transmission
Manual Automatic
£ 3,150 £ 6,400
£ 5,300 £ 4,850
£ 5,250 £ 7,150
£ 5,700 £ 5,900
£ 5,050
£ 5,000
£ 3,650
£ 4,950
£ 6,400
£ 4,900
£ 5,950
£ 5,250
£ 2,600
£ 6,250
In Figure 9.10 we see how the car prices depend on the type of trans
mission. If the car is manual then the average price is £4,957, and £6,075
if automatic. If we had no other information (such as mileage, age, etc.)
so that this was the end of the branches then those would be our forecasts
for any other Peugeot Partner Tepee we might come across. I.e. the model
we use is the average at each leaf, but it’s a different average for each leaf.
We’l l see something a bit more sophisticated than the average shortly.
But how good is that attribute at predicting compared with say predicting
price based on number of seats or mileage? We need to know this if we are
build our tree most efficiently. To figure that out we need something to
replace entropy. And the easiest quantity to use is the sum of squared
errors:
E (s“
Manual
anual V
E(
Auto
ijAuto - y
)
9.8. REGRESSION 141
And now this is interpreted as follows. The threshold for the mileage, say,
is s. And we are looking at prices for cars under and over the mileage of
th
s. Xm^ means the numerical value of the rn*'’ attribute (mileage) of the n
data point (car). The minimization over y Just means each branch has its
own model i.e. value. Rather obviously the minimization over y will result in
the value for each y being the average of values on that branch. And then
finally we want to find the threshold that gives us the best split of the data.
Let’s look at the example.
Mileage
£ 6,250 £ 5,000
£ 5,950 £ 3,650
£ 6,400 £ 6,400
£ 5,250 £ 5,250
£ 3,150 £ 2,600
£ 5,900 £ 4,850
£ 5,050
£ 4,900
£ 7,150
£ 5,700
£ 5,300
£ 4,950
In Figure 9,11 we see the data divided according to the mileage attribute
with the threshold of 40,000. The mean value (the y) for under 40,000
miles is £5,496 and £4.625 for above. The sum of squared errors, as in
expression (9.1), is 19,771,041. This is worse than using the transmission
type for regression. But that’s because I just picked 40,000 at random. We
can minimize this error by choosing a threshold of 60,000 when the squared
error becomes 17,872,343. This is now better than using transmission but
not by much.
Anyway, we can continue this process by looking for the best attribute
to be the root of our decision tree. And then work our way down the tree
looking at the remaining attributes exactly as in the classification problem.
To summarize: Choose an attribute, m, and for that attribute find a
threshold s to minimize (9.1). The m gives us which feature to choose and
the s tells us where the split is.
I don't know whether this will help or confuse but there are three mini
mizations of the standard deviation happening here:
1. There is the minimization within each node, finding the best model.
Here this is trivially the mean of the data in each node
3. Then we choose which feature gives us the lowest error and this be
comes the one we use to move further down the tree
Linear regression
A simple splitting of the data into two branches using a threshold of 6.2
years gives the lowest sum of squared errors at 15,616,176. (This happens
to be our best error yet, so really ought to be the root.) But we are lumping
together a lot of data points and throwing out any nuance. You’ l l see that
nuance now when I plot value against age.
What if instead of splitting the data at a threshold we fit a straight line
to the data? We’d get the results in Figure 9.13 and an error of 12,457,500.
8000
7000
6000
5000
(U
4000
Q.
3000
2000
1000
0
0 2 4 6 8
Age
And finally, in the spirit of both splitting and linear regression, split at
5.2 years and fit two straight lines, a different one above and below the
split, gives the results shown in Figure 9.14, with an error of just 8,609,273.
Of course this is no more than a more complicated fitting function than a
straight line.
8000
7000
6000
5000
cu
4000
Q.
3000
2000
1000
0
0 2 4 6 8
Age
You should see what I mean now, by fitting with a straight line or lines
I've kept more information from the data and got a better forecast.
You can get more complicated by making decisions based on (linear per
haps) combinations of features. (Age minus some parameter times the num
ber of seats.)
Decision trees can have a tendency to overfit. Think of the rituals that a
footballer might go through before a match based on experiences from the
past when he won or lost. Lucky socks? Check. Leave dressing room left
foot first? Check. No curry the night before? Check. To reduce overfitting
and variance in predictions one can use various methods for aggregating
results over many trees.
Bootstrap aggregation or bagging for short is a method in which one
samples, with replacement, from a full data set and then applies whatever
machine-learning technique you are using to get a forecast, and then repeats
with another random sample. One does this many times and the final forecast
will be either the majority vote in a classification problem or an average for
a regression problem.
Random forests is the same idea with one twist. In deciding which node
to use for a split you don't look at the best feature for splitting over a//
possible features. Instead you consider a random subset of features and
choose the best feature from that subset.
Further Reading
Neural Networks
You will have seen pictures like Figure 10.1, representing a typical neu
ral network. This is a good illustration of the structure of a very simple
147
148 CHAPTER 10. NEURAL NETWORKS
feedforward network. Inputs go in the left of this picture and outputs come
out of the right. In between, at each of the nodes, there will be various
mathematical manipulations of the data. It’s called feedforward because the
data and calculations go in one direction, left to right. There is no feedback
or recurrence like we'll be seeing later.
A.
o. ,\
\
Figure 10.1: A typical neural network. Here there is an input layer, an output
layer and one hidden layer in between.
The Universal Approximation Theorem says that, under quite weak con
ditions, as long as you have enough nodes then you can approximate any
continuous function with a single-layer neural network to any degree of ac
curacy. The neural-network architecture for this simple task is shown in
Figure 10.2. There is a single input, think x, and a single output, think y.
To get a better approximation you will just need more nodes in the hidden
layer.
o.^
o
o
This is one of the most important uses for a neural network, function
approximation. But usually our problems are not as straightforward as this
implies. Instead of having a single independent variable, the x, we will
typically have many, a whole array. And instead of a single output, the y,
there could be many outputs. And most importantly instead of having a
known function that we want to approximate we have a whole load of data,
the inputs and the outputs, and we want our network to figure out what
happens at the nodes to give the best approximation, one that can be used
on new, unseen, data. With such complicated problems we'll need the richer
150 CHAPTER 10. NEURAL NETWORKS
structure that you can get by having more than one hidden layer.
Ok, now it's time to tell you what goes on in the hidden layer, what are
the mathematical manipulations I've mentioned.
Figure 10.3 shows just about the simplest network, and one that can still
be used to describe the manipulations going on.
b
Xi ai
VVi
W-2,
^2 ^2
tti = Xi, 02= X2
Z = M/iOi + W202 + b
y = 9^7)
Figure 10.3: The manipulations in the classical neural network.
Flere we have two inputs on the left, x\ and X2, and a forecast/output
on the right, y.
The inputs are passed unchanged into the first nodes:
Then each of these node values is multiplied by a weight, the ws. And
then a bias, b, is added:
z = wia\ + W2a2 T h.
y = y{z)-
This is a very, very simple transformation of the data, here used to give
some function of a two-dimensional input.
10.6. THE MATHEMATICAL MANIPULATIONS IN DETAIL 151
Now let’s do that with a larger network, showing al l the sub- and super
scripts we’ll be needing.
In Figure 10.4 is a neural network with a single hidden layer. For more
layers the following is no different. Just keep an eye out for the superscript
denoting which layer we are looking at.
(n)
X
1
(I
a..':
(71)
^2 V-
X
yi
(71) (1)
Xm
Figure 10.4: The quantities being input, output and in the nodes.
Notation
I show three types of quantities in the figure. The inputs are the xs. The
outputs are the jjs. The values in each of the nodes are the as. Let’s look
at the sub- and superscripts, etc., and what they mean.
As before the superscript (n) refers to the data point. The subscript
m on X is the position in the vector of features, of which there will be M.
So our inputs will be vectors
152 CHAPTER 10. NEURAL NETWORKS
These quantities are then input into the first node. The input layer of
nodes contain values am\ as shown by the direction of the arrows. We put
the input values, the xs, unchanged into the first layer:
(1)
a —
The nodes in the next layers have the values a^ . Here I represents
the number of the layer, out of L layers, and the j represents the specific
node. Note that the number of nodes in each hidden layer can be anything.
However the input layer has the same number of nodes as features, and the
output layer the number of nodes needed for our forecast.
We continue like this until the output. The (n) means the output associ
ated with the riT data point. That will be another vector with usually
a different dimension from that of the input vector. And the hat in this just
means that this is the forecast value, as opposed to the actual value for the
dependent quantity from the training data.
As I showed above, two things happen to the numerical quantities as we
go through the network.
Propagation
In going from one layer to the next the first thing that happens is a linear
combination of the values in the nodes. Figure 10.5 is similar to Figure 10.3
except with more nodes and sub- and superscripts everywhere.
l3yer/-l
I have labelled the two layers as / — 1 and 1. Also notice that the general
node in the left-hand layer is labelled j and one in the right-hand layer, layer
/, is labelled j'.
We want to calculate what value goes into the node of the layer.
First multiply the value in the node of the previous, (/ — 1)*'’,
layer by the parameter wjjj, and then add another parameter . Then we
add up all of these for every node in layer I — 1.
This is just
Jl-l
(0
E ^3,3'^3 + (10.1)
1=1
where Ji means the number of nodes in layer 1. I'l l call this expression .
This is a bit hard to read, such tiny fonts, and anyway it’s much easier
to write and understand as
(10.2)
with the matrix containing all the multiplicative parameters, i.e. the
weights Wjj/, and is the bias. The bias is Just the constant in the linear
transformation. (Sometimes the bias is represented by an extra node at the
top of the layer, containing value 1, with lines coming down to the next
layer. This way of drawing the structure is exactly equivalent to what we
have here.)
This Just means that the first manipulation, to get to the z, is to take a
linear combination of values in the previous layer. But remember, we don't
specify what the weights and biases are, they are to be found during the
training of the network.
If that were it then it wouldn’t be interesting. But we still have to perform
the other simple transformation.
The activation function gets its name from a similar process in physical,
i.e. inside the brain, neurons whereby an electrical signal once it reaches a
certain level will fire the neuron so that the signal is passed on. If the signal
is too small then the neuron does not fire.
Applying the same idea here we simply apply a function to expressions
(10.1) or (10.2). Let's call that function It will be the same for all
nodes in a layer but can differ from layer to layer. And we do specify this
function.
Thus we end up with the following expression for the values in the next
154 CHAPTER 10. NEURAL NETWORKS
layer
(i-i) +
a(') =/)(zW)=g(')(W(') a
b(')) (10.3)
The function of a vector just means taking the function of each entry.
So a signal is passed from all the nodes in one layer to the next layer
where it goes through a function that determines how much of the signal to
pass onwards.
And so on, all the way through the network, up to and including the
output layer which also has an associated activation function.
This can be interpreted as a regression on top of a regression on top of
a . ..
You can see that what comes out of the right is just a messy function of
what goes in the left. I use messy in the technical mathematical sense that
if you were to write the scalar ys explicitly in terms of the scalar xs then it
wouldn't look very pretty. It would be a very long expression, with lots of
sub- and superscripts, and summations. But a function it most definitely is.
Classification problems
I’ve mentioned this before, but it's worth repeating, classification is dif
ferent from regression and function fitting.
Suppose you want to classify fruit. You have peaches, pears, plums,
etc. Your raw data for the xs might be numerical quantities representing
dimensions, shape, colour, etc. But what will the dependent variable(s) be?
You could have a single dependent y which takes values 1 (for peach), 2 (for
pear), etc. But that wouldn't make any sense when you come to predicting a
new out-of-sample fruit. Suppose your output prediction was y = 1.5. What
would that mean? Half way between a peach and a pear perhaps? That’s
fine if there is some logical ordering of the fruit so that in some sense an
peach is less than a pear, which is less than a plum, and so on. But that's
not the case.
It makes more sense to output a vector y with as many entries as there
are fruit. The input data would have a peach as being (1,0,...,0)^, a pear
as (0,1,...,0)"^’ and so on. An output of (0.4,0.2,... , 0.1)^ would then be
useful, most likely a peach but with some pear-like features.
Linear function
g{x) = X.
The step function behaves like the biological activation function described
above.
, - f 0 a; < 0
= 1 1 ,r>o •
The signal either gets through as a fixed quantity, or it dies. This might
be a little bit too extreme, leaving no room for a probabilistic interpretation
of the signal for example. It also suffers from numerical issues to do with
having zero gradient everywhere except at a point, where it is infinite. This
messes up gradient descent again. Probably best to avoid.
g{x) = max(0,x).
ReLU stands for Rectified Linear Units. It’s one of the most commonly used
activation functions, being sufFiciently non linear to be interesting when there
are many interacting nodes. The signal either passes through untouched or
dies completely. (Everyone outside machine learning calls it a ramp function.)
0 X <0
9{x) X 0<X <1 .
1 X > 1
1
g{x) =
1 + e~^
This is a gentler version of the step function. And it’s a function we have
found useful previously. It could be a good choice for an activation function
if you have a classification problem.
The tanh function can also be used, but this is just a linear transformation
of the logistic function.
Softmax function
K
Efc=l
It is often used in the final, output, layer of a neural network, especially with
classification problems.
One can be quite flexible in choosing activation functions for hidden layers
but more often than not the activation function in the final, output, layer
will be pretty much determined by your problem.
Our goal is to ultimately fit a function. But I haven't yet said much
about the function we are fitting. Typically it won’t be the sine function
we’l l be seeing in a moment as our first example. It will come from data.
For each input independent variable/data point of features there will
be a corresponding dependent vector This is our training data.
Our neural network on the other hand takes the as input, manipu
lates it a bit, and throws out Our goal is to make the y*^") and y^”^ as
10.9. EXAMPLE: APPROXIMATING A FUNCTION 157
Approximating sin(2x)
-8
The notation is obvious, I hope, is the dependent data for the data
point, the k representing the H'' node in the output vector, and is similar
but for the forecast output.
(In the above sine wave example we only had one output so K = 1.)
Classification
But if we have three or more classes then we have to sum over all of the
K outputs, K being the number of classes:
in)
J =-
n=l fc=l
-Vk )(l-ln(r)))^ (10.5)
10.11 Backpropagation
dJ dJ
and
^3' dbf}-
If we can find those sensitivities then we can use a gradient descent method
for finding the minimum. But this is going to be much harder here than in
any other machine-learning technique we have encountered so far. This is
because of the way that those parameters are embedded within a function of a
function of a. .. To find the sensitivities requires differentiating that function
of a function of a...And that’s going to be messy, involving repeated use
of the chain rule for differentiation, unless we can find an elegant way of
presenting this. Fortunately we can.
And this is made relatively painless by introducing the quantity
dJ
dz^} ■
Remember what z is, it’s the linear transformation of the values in the
previous layer, but before going into the activation function.
This idea is called backpropagation. Backpropagation is rather like cal
culating the error between the y and the y in the output layer and assigning
that error to the hidden layers. So the error in effect propagates back through
the network. This is slightly strange because although there is an obvious
meaning for the error in the output layer (it’s just the difference between
the actual value of y and the forecast value y) there is no correct a in the
hidden layer with which to make a comparison. But that doesn’t matter.
Backpropagation is going to tel l us all we need in order to find our parame
ters.
I’m sorry but we’re going to have another one of those mini networks.
See Figure 10.7. This shows pretty much all we need in the following. Our
first goal is to find the sensitivity of the cost function to
160 CHAPTER 10. NEURAL NETWORKS
Layer/-I Layer /
/
O /
o Node /’
O
Node j
O
Figure 10.7: The network showing the key formulas.
(0 (1-1)
j'
3 3
And so
dJ (i)
dz E PA(0
dg('-i)
dz
(10.6)
J f
Let's see what we have achieved so far, and what we haven't. Equation
(10.6) shows us how to find the 5s in a layer if we know the (fs in all layers
to the right. Fantastic.
But what is so great about these 5s7 That's actually the easy bit:
dJ dJ dzf
3^3' ~ dz^) dwf.,
(10.7)
10.11. BACKPROPAGATION 161
And that's done it! The sensitivity of the cost function, ./, to the ws
can be written in terms of the 5s which in turn are backpropagated from the
network layers that are just to the right, one nearer the output.
And sensitivity of the cost function to the bias, bl That’s even easier.
dJ
5f.
In the above we have the derivative of the activation function with respect
to z. This will be a simple function of z depending on which activation
function we use. If it is ReLU then the derivative is either zero or one. If we
use the logistic function then we find that g'{z) = g{\ - g), rather nicely.
The last hidden layer is different because it feeds into the output. If the
cost function is quadratic, for example, then we have instead
(L)
dz
iVj -Vi)-
(To avoid confusion, if there is a single output then you can drop the j
subscripts.)
You don’t actually need to know the actual value of the cost function
to find the weights and biases but you will want to calculate it so you
162 CHAPTER 10. NEURAL NETWORKS
dz
(y-y)-
sf-'> dz
i'
dJ
New w^], = Old -p = Old
id'
and
dJ
New bf = Old bf - p = 0\dhf-p5f.
dhf
Return to Step 1.
ipython console
0,8,0^0,0,0,0,8,0,0,8,0,0,0,8,0,8,8^8,0,0,Qj0,0,0,0,0,6,0,0,8,0,0,8,0,8,0,6, 8,e,@,0,8,
8,8,0,9,0,0,8,0,0,8,8,0,0,8,0,8,8,0,8,8,0,8,8,0,8,e,0,8,0,0,0,8,0,0,0,8,8,e,8,0,6,0,8,8,0,0,8,
0,6,0,0,6,0,8,8,0,8,8,8,3,18,18,18,126,136,175,26,165,255,247,127,8,0,0,8,0,8,0,0,8,0,0,8,38,3
6,94,354,178,253,253,253,253,253,225,172,253,242,195,64,0,9,8,0,9,0,0,0,0,0,0,48,233,253,253,2
53,253,253,253,253,253,251,93,82,82,56,39,8,0,0,8,0,8,0,0,0,8,0,0,13,219,253,253,253,253,253,1
93,182,247,241,8,0,0,0,0,8,8,0,8,0,0,0,0,0,8,8,0,9,80,156,107,253,253,205,11,0,43,154,8,0,0,0,
0,8,0,9,0,0,0,8,8,8,8,0,0,,8,0,14,1,154,253,98,0,8,8,8,0,8,8,0,6,0,8,8,8,0,8,8,0,0,8,0,0,0,0,0,
0,139,253,190,2,0,0,0,0,0,0,9,0,0,9,0,0,9,8,0,9,8,0,3,0,0,9,8,0,11,190^253,79,0,0,0,0,0,0,0,0,
0,0,8,0,0,6,0,0,8,0,0,8,0,0,8,0,0,35,241,225,158,108,1,0,0,8,0,0,6,0,0,8,0,0,8,0,0,6,0,0,8,0,0
,0,0,9,31,240,253,253,119,25,0,0,0,0,9,0,0,8,0,0,0,0,0,9,8,0,9,0,0,9,0,0,9,45,186,253,253,159,
27,0,0,6,0,0,6,0,0,8,0,0,6,0,0,6,0,0,8,8,0,8,0,0,16,93,252,253,187,8,0,0,9,8,0,0,8,0,0,9,0,0,6
,8,9,0,0,9,0,0,9,0,0,9,249,253,249,64,9,9,0,0,9,0,8,9,0,8,8,8,0,9,0,6,9,0,0,9,0,46,130,133,253
,253,207,2,0,0,0,8,0,0,8,0,0,8,0,9,0,0,8,0,0,3,8,39,148,229,253,253,253,250,182,0,9,8,8,3,8,0,
0,6,8,9,0,8,8,0,0,0,0,24,114,221,253,253,253,253,201,78,0,8,8,9,0,8,0,0,8,0,0,8,0,9,8,0,8,23,6
5,213,253,253,253,253,198,81,2,0,0,0,0,0,0,9,8,0,9,0,0,9,0,0,9,18,171,219,253’253,253,253,195,
38,9,0,8,0,0,0,0,0,8,0,0,8,0,0,6,0,8,55,172,226,253,253,253,253,244,133,11,8,8,8,6,0,0,0,0,0,6
,0,0,8,0,0,0,0,9,136,253,253,253,212,135,132,16,0,0,9,8,8,9,0,0,0,8,6,9,0,0,9,0,0,9,8,0,9,8,0,
9,e,0,9,0,0,9,0,0,9,0,0,9,8,0,9,0,0,9,0,6,9,0,6,9,0,0,9,0,0
10 -
IS
20
25
0 5 10 15 20 2S
The network that I use has 784 inputs, just one hidden layer with a
sigmoidal activation function and 50 nodes, and 10 outputs. Why 10 outputs
164 CHAPTER 10. NEURAL NETWORKS
and not just the one? After all we are trying to predict a number. The reason
is obvious, as a drawing it is not possible to say that, say, a 7 is mid way
between a 6 and an 8. So this is a classification problem rather than a
regression.
For the first time in this book I'm actually going to do both the training
and the testing.
The MNIST training set consists of 60,000 handwritten digits. The net
work is trained on this data by running each data point through the network
and updating the weights and biases using the backpropagation algorithm
and stochastic gradient descent. If we run all the data through once that is
called an epoch. It will give us values for the weights and biases. Although
the network has been trained on all of the data the stochastic gradient de
scent method will have only seen each data point once. And because of the
usually small learning rate the network will not have had time to converge.
So what we do is give the network another look at the data. That would
then be two epochs. Gradually the weights and biases move in the right
direction, the network is learning. And so to three, four, and more epochs.
We find that the error decreases as the number of epochs increases. It
will typically reach some limit beyond which there is no more improvement.
This convergence won’t be monotonic because there will be randomness in
the ordering of the samples in the stochastic gradient descent.
In Figure 10.10 is a plot of the error, measured by the fraction of digits
that are misclassified, against the number of epochs.
0.07
0.06
2 0.05
0.04
0.03
0.02
0.01
0
5 10 15 20
Number of epochs
Let’s look at how well the trained network copes with the test data. In
Figure 10.11 we see the error versus epoch for the test data as well as for
the training data.
Clearly the network is not doing so well on the test data, with only a
95% accuracy. That's almost double the error on the training data. But it's
still pretty good for such a simple network.
0.07
Training set
0.06 Test set
2 0.05 ^%
^^
0.04
0.03
0.02
0.01
5 10 15 20
Number of epochs
Figure 10.11: Error versus number of epochs, training and testing sets.
architecture. For example suppose one has 784 inputs, 50 nodes in one
hidden layer and ten outputs then the time taken will be proportional to
784 X 50 + 50 X 10 = 39,700.
But if we had, say, 784 inputs, one hidden layer of 30, another of 20, and
ten outputs then the time would be proportional to
784 X 30 + 30 X 20 + 20 X 10 = 24,320.
The same number of nodes, but only two thirds of the time.
(This is assuming that different activation functions all take roughly the
same time to compute.)
My digits
I showed a few of my handwritten digits to the trained network. In Figure
10.12(a) is my number 3 together with the digitised version. It got this one
right. Figure 10.12(b) is my seven, it got this one wrong. Perhaps because
it's European style, with the bar.
1
io i
Nonsense
Autoencoder
The autoencoder is a very clever idea that has outputs the same as the
inputs. Whaaat? The idea is to pass the inputs through a hidden layer (or
more than one) having significantly low/er dimension than the input layer and
then output a good approximation of the inputs. It’s like a neural network
version of Principal Components Analysis in which the original dimension of
the data is reduced (to the number of nodes in the hidden layer).
Training is done exactly as described above, just that now the ys are the
same as the xs.
You can see the idea illustrated in Figure 10.14.
-0.23 (jo.2-1 )
1.07
-0.16 -0.14
\,
■\ ■\ ./• / -N
-0.98 / i-1.00
.y-'
0.84
\
0.67
-
) -j^67
0.11
0 •1.74
/' .y-'
-0.14 f-0.14 ]
1.83
2.11 '
/-
{d.70 ■0.69
The data goes in the left-hand side (the figure Just shows one of the sam
ples). It goes through the network, passing through a bottleneck, and then
comes out the other side virtual ly unchanged. You’l l see in the figure that
the output data is slightly different from the input since some information
has been lost. But we have reduced the dimension from ten to three in this
example.
Now if you want to compactly represent a sample you Just run it through
10.14. MORE ARCHITECTURES 169
the autoencoder to the bottleneck and keep the numbers in the bottleneck
layer (here the 0.84, 0.11, -1.74). Or if you want to generate samples then
just put some numbers into the network at the bottleneck layer and run to
the output layer.
There doesn't have to be Just the one hidden layer. As always there can
be any number. But the lowest dimension is the dimension of the smallest
layer.
Wi e
-bIx-Cjp
instead of, say, a sigmoidal function. And then the output layer would add
up all of these, perhaps with another activation function transformation as
well.
You can probably see overlaps with self-organizing maps, support vector
machines and K nearest neighbours.
a(n) vW
^(n) a(n)
You will often hear the phrase Deep Learning. Different people mean
different things when they use the phrase, there doesn't seem to be a gen
erally recognised definition. But what all the different interpretations have
in common is that neural networks are involved and that they have a large
number of neurons and hidden layers.
As the amount of data that we have access to explodes and the speed
of computation increases so it makes sense to explore the possibilities of
networks that are deeper and deeper. Why stop at L layers? When you get
more data you might as well look at the performance of L + 1 layers and
more sophisticated architecture.
Further Reading
Reinforcement Learning
173
174 CHAPTER 11. REINFORCEMENT LEARNING
11.4 Jargon
• Action: What are the decisions you can make? Which one-armed
bandit do you choose in a casino? Where do you put your cross in the
game of Noughts and Crosses? How many, and which, cards do you
exchange in a game of draw poker? Those are al l example of actions.
there might not be any reward until the end of the game. At the end
of the chess game the winner gets the $1,000 prize.
But there aren't just rewards. To win the prize you have to be the first
to solve the jigsaw puzzle. Every second you take can be thought of
as a punishment.
in the deck (or decks, plural, in casino Blackjack the dealer will start with
five or more decks shuffled together). Or equivalently knowing what cards
have been dealt out already. (Well, not their suits, and 10, Jack, Queen and
King all count as 10s, but that's still a lot of into you need to know.) So
Blackjack is an MDP if you are able to memorize all of this information.
But that's unrealistic for a mere mortal.. .and you aren't allowed to use a
computer in a casino. See Rain Man. At the end of this chapter I’ll briefly
show you how you can approximate the state sufficiently to win at Blackjack.
An exception to this is when there is an infinite number of decks being
dealt from, in an online game. In that case Blackjack is an MDP if the
state is represented by your cards and the dealer’s upcard. That’s because
the probabilities for the next cards to be dealt never changes. But then you
can't win if you play with an infinite deck. Confused? You will be!
Markov refers to there being no memory if your state includes enough
information. So that is part of the trick of learning how to play any game.
Keep track of as many variables as needed, but no more.
You can get a lot more sophisticated with different types of MDP. For ex
ample in a Partially Observable Markov Decision Process(POMDP) you only
get to see some of the current state. And so you might have a probabilistic
model for what you can't see.
The state in O&Xs is represented by the positions of the Os and Xs. The
order in which they were played, i.e. how you got to that state, is irrelevant.
Starting with an empty grid and it is us, with Xs, to go first we could sketch
something like Figure 11.1 showing the three possible moves. There are only
three despite the nine cells because of symmetries.
This structure continues for all possible plays in the game. From each
178 CHAPTER 11. REINFORCEMENT LEARNING
state branch out other states for the next play. Some finish when one player
gets three in a row. And some end up with all cells filled and no winner. See
Figure 11.2.
X X
End
X
00 0
© y
X )Ljx_XOX
/
X X
0 0 o 0
W'"-
X X
X X
0X
0 0
o 0
\
\ XXX
End
0 0
In Figure 11.2 I have labelled moves from state S as L for Lose, D for
Draw and W for Win. For example at the very top right you will see that
our opponent, the Os, have got three in a row. That's an L for us.
Now we ought to make some assumptions about how good our opponent
is. That will help us make decisions about what actions to take.
And what if we don’t know how good our opponent is? No problem.
Because that's life! And that's what reinforcement learning is all about.
Learning on the job, while experiencing the environment rather than having
the environment programmed in.
In O&Xs each state would have a value, a value that percolates down
from later states in the game.
I'l l be talking about the value function in relation to reinforcement learn
ing in some detail and give some ideas about how to set it up.
So that's how you might address the game of O&Xs classically. Other
games and problems can be set up in the expanding tree structure as well.
Let's start to move out of the waffly, descriptive mode, and start at least
hinting at some mathematics. We shall do this by looking at an extremely
popular illustrative example.
180 CHAPTER 11. REINFORCEMENT LEARNING
• There will be ten bandits. The probabilities of winning for each bandit
are
Bandit 1 10%
Bandit 2 50%
Bandit 3 60%
Bandit 4 80%
Bandit 5 10%
Bandit 6 25%
Bandit 7 60%
Bandit 8 45%
Bandit 9 75%
Bandit 10 65%
• The value of each action, each bandit, will simply be the average
reward for that bandit. This average is only updated after each pull of
a lever. To be precise, it is only the chosen bandit that has its average
11.8. EXAMPLE: THE MULTI-ARMED BANDIT 181
• After each pull we have to choose the next bandit to try. This is done
by most of the time choosing the action (the bandit) that has the
highest value at that point in time. But every now and then, let's say
at a random 10% of the time, we simply choose an action (bandit) at
random with all bandits equally likely.
That last bullet point describes what is known as an e-greedy policy. You
choose the best action so far, but you also do occasional exploration in case
you haven’t yet found the best policy. The random policy happens a fraction,
e, of the time.
I'd also like to emphasise that only you, the reader, and I know the prob
abilities of winning for each bandit. We will not be telling the reinforcement
learning algorithm explicitly what these probabilities are. As I said above,
the algorithm only indirectly experiences them.
Bandit 4 Value
Bandit 3
■ n Bandit 9
“ /V I
0.6 I
&
7
1
0.4
0.2
Number of pulls
0
0 10 20 30 40 50 60 70 80 90 100
In Figure 11.3 is shown the value function for each of the ten bandits as
a function of the total number of pulls so far. I have used Q to denote this
value, we'll see more of Q later. And in Figure 11.4 we see the fraction of
time each bandit has been chosen.
182 CHAPTER 11. REINFORCEMENT LEARNING
<\
\
0.8 Sand it 9
Bandits \
0.6
\
\
Bandit 4
0.4
Number of pulls
0.2
0
0 10 20 30 40 50 60 70 80 90 100
It's hard to see what's going on in black and white with so many curves,
but let me walk you through it. Bandit 3 was chosen at random for the first
pull. The first pull is always random, we have no data for the value yet, and
all Qs are set to zero. Bandit 3 loses. For the next pull the Qs are still all
zero and so each bandit is equally likely. Coincidentally Bandit 3 is chosen
again. This time Bandit 3 wins. Its Q value is now 0.5 and so is currently
the best bandit. This means that it will be chosen preferentially except when
the 10% random choice kicks in. Bandit 4 eventually gets picked thanks to
this random choice. It wins and thus now becomes the preferred choice. It
wins twice, then loses. And so it goes on. Up to 100 pulls it is looking like
Bandit 9 is the best. But then...
0.8
0.6 Bandit 4
0.4
Bandit 9
0.2
Number of pulls
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Figure 11.5: Fraction of time each bandit has been chosen, longer run.
11.9. GETTING MORE SOPHISTIGATED 1 183
In Figure 11.5 is shown the fraction of pulls for each bandit up to 10,000
pulls. Clearly Bandit 4 has become the best choice. And if you look at Table
11.1 you’l l see that it does indeed have the highest probability of success.
The correct bandit will eventually be chosen however the evolution of the
Qs and the fraction of time each bandit is chosen will depend for a while on
which bandits are chosen at random. So in Figure 11.6 I show the results
of doing 100 runs of 10,000 pulls each. Having 100 runs has the effect of
averaging out the randomness in the experiments. Note the logarithmic scale
for the horizontal axis.
0.6
Bandit 4
0.4
Bandit 9
A'
0.2
0
1 10 100 1000 10000
Figure 11.6: Fraction of time each bandit has been chosen (log scale) —
averaged over 100 runs.
We’ve done the preparatory work, seen basic ideas of how and where
reinforcement learning can be applied, now let’s start putting some proper
mathematical flesh on these bones.
In the next few sections we are going to do some of the mathematics
behind valuation for a given policy, policy evaluation, and then find the
optimal policy, the control problem. But initially we are going to restrict our
attention to known environments. Known environment in an MDP means
that for any state and action we have a known probabilistic description of
our next state and any reward.
184 CHAPTER 11. REINFORCEMENT LEARNING
Basic notation
States Let’s use S to denote the general state and s to mean a specific
state. I'll use s' to mean the next state (after we’ve taken some action).
And sometimes I’l l give s a subscript when I want to emphasise the time or
step, St.
Actions I'll use A to mean the general action and a to denote a specific
action. Again the following action will be a' and sometimes the action will
have a subscript when I want to emphasise the time or step at which the
action is taken, aj. Note that taking a certain action does not guarantee
what state we’l l end up in. For example, in Blackjack taking another card
does not tel l us what our next card count is. But we could have a simple
case, like in the maze that we'l l see later, where the action of moving right
does tel l us our next state.
Rewards Generally the reward can depend on any or all of action a, cur-
rent state s and next state s'. I'll use r to denote the actual reward. Some
times the r will have a subscript if 1 need to make it clear which reward I am
talking about. For example ry+i is the reward that comes between state St
and state S(,+i.
11.9. GETTING MORE SOPHISTICATED 1 185
P.
ss' Prob (si+i = s' St = s,at = a).
For our maze below these probabilities would be either 1 or 0, because a
given action tells us exactly where we go next. But for Blackjack the action
of taking a card puts us in a random next state.
Goal or return The total reward, called the return or goal, that lies ahead
of us is
OO
The parameter 7 G [0,1] is a discount factor that says that the sooner the
reward the more we value it.
186 CHAPTER 11. REINFORCEMENT LEARNING
The discount factor If you come from a finance background then you
will know that use of discount factors, related to interest rates, is extremely
common. We talk of the time value of money and opportunity cost meaning
all things being equal it's better to get money upfront. A bird in the hand is
worth two in the bush. However the discount factor in reinforcement learning
can seem a bit arbitrary. Non-financial reasons for introducing one are:
And the geometric time decay (the increasing powers of 7) is the most
common discounting mechanism since it has nice mathematical properties,
as we shal l see.
A B C
START
FINISH
3
We start in Cell Al, can only go right from there, and when we get to
Cell C3 we stop. Our state is represented by which cell we are in, B3 for
example.
In a Markov chain the probability of moving from one state to another is
given. But recall that in our notation above we had a potentially stochastic
action (the policy, tt) and a potentially stochastic transition (the transition
probabilities, P^^,). In the simple Markov chain here these two can be com
bined into one transition probability matrix, and that's because given an
action (albeit random) the following state is deterministic: Move down from
B1 and you always end up in B2. (Another way of thinking of this is that
Pgg, only contains ones and zeroes.)
To represent the (combined) transition probabilities we would need a 6
X 6 matrix (there are six cells). See Figure 11.8. In this example I am going
to make a move from one cell to a neighbouring one equally likely: If there
is only one neighbour then the probability of moving to it is 100%, if two
neighbours then 50% each, etc. At this point there is nothing special about
Cells Al and C3, other than we can only go in one direction from Al and
we stop when we get to C3, there is no attempt to go from Start to Finish.
The transition probabilities tell us where to go with no goal in mind.
To
Al A3 B1 B2 B3 C3
From Al 0 0 1 0 0 0
A3 0 0 0 0 1 0
B1 0.5 0 0 0.5 0 0
B2 0 0 0.5 0 0.5 0
C3 0 0 0 0 0 1
If the next move we make is given by the transition matrix in Figure 11.8
then we can write down a recursive relationship for the expected number of
steps it will take to get from any cell to the finish. Denote the expected
188 CHAPTER 11. REINFORCEMENT LEARNING
number of steps as a function of the state, i.e. the cell we are in, v{st).
We have
r;{Al) = 1 +v(Bl)
because whatever the number of expected steps from Bl, the expected num
ber from A1 is one more. Similarly
v{A3) = 1 + w(B3).
r;(Bl) = l + i(r;(Al)+t;{B2)).
And so on.. .
u(B2) = l + i(,;(Bl)+,;(B3)),
r;(B3) = 1 + ^(w(A3)+ r;(B2) + v{C3))
O
(from B3 there are three actions you can take) and finally
v{C3) = 0 + r;(C3).
You never leave Cel l C3.
These can be written compactly using vector notation. Let’s write v as
a vector v with each entry representing a state, Al, A3, ... , C3. Similarly
we'l l write the reward as a vector r.
In the above we have added 1 for every step in order to calculate the
expected number of steps in each state. But to make this more like a
reinforcement learning problem I am going to change the sign of the reward,
so that we are penalized for every step we take. This ties in with the later
idea of optimizing return. Maximizing the total reward when each reward is
negative amounts to minimizing the number of steps. The final answer for
V will be the negative of the expected number of steps for each state.
So the first five entries in this reward vector, r, will be minus one, and
the final entry will be zero. This just means that in going from state to state
we get a reward of —1, but since we can't leave cell C3 there is a zero for
that entry. And we shall write the transition matrix in Figure 11.8 as P.
Writing the above in vector form to find v all we have to do is solve
V = r + Pv. (11.1)
This tells us the relationship between all the expected values, one for each
state. It is a version of the Bellman equation. We shall be seeing different
forms of the Bellman equation later.
This can be solved by putting the two terms in v onto one side and
inverting a matrix. However since inverting matrices can be numerically
11.10. EXAMPLE: A MAZE 189
A B c
^ p
3 10.00
I
0.00
I
-1.00
I
-2.00
-3.00 -3.00 Ar = 3
I
-2.83
-4.00 -3.92 /c = 4
I
-3.67
The final result, after nnany iterations, is that shown in Figure 11.10.
The numbers are easy to check.
A B C
2 -14
3 -10 -9 0 FINISH
That looks to me like it's taking an awfully long time to get from A1 to
C3 if you move randomly, 18 steps on average.
Of course, the big difference between Markov chains and MDPs is in the
potential for us to make (optimal) decisions about which actions to take.
Ultimately our goal in the maze is trying to get from start to finish asap. We
want to deduce, rather trivially in this maze, that, for example, the optimal
policy when in cell B3 is to move right. That will come later, after I’ve
introduced some more notation.
State-value function
Let's forget the maze example for a while and set out a fairly general
problem in which we have a given policy, given rewards and given probabilities
for changing from state to state. Everything can be stochastic and the
rewards and transition probabilities can be functions of ail of s, a and s'.
In the above I used v (or v) to represent the expected (negative of the)
number of steps. More generally in an MDP we can use v to represent a
value for each state. Such a state-value function is defined as
Action-value function
. . . defined as
IS— A — ai].
This means the value when in state St of taking action A = at and afterwards
following the policy n. So that's the immediate reward plus the value that
comes after.
(And that’s the Q we saw in the section on the multi-armed bandit.)
This is the quantity that will later tell us what is the best action to take
next, the one that maximizes Q^(st,at).
The relationship between the two value functions is
(st) ='^-K{a\st)Q^{st,a).
-11.-r
Q(Bo,UP)
irob. = 3.333
A B
Figure 11.11: The relationship between the value functions for Cell B3 in
our maze.
192 CHAPTER 11. REINFORCEMENT LEARNING
You can already start to see the benefits of the action-value function Q
from the figure. Out of the three possible actions in state B3 it’s moving to
the right that has the highest Q value. Of course, that's assuming that the
state values in the adjacent cells are correct.
You should anticipate an iterative solution for the Q function, and thus
the fastest route through the maze. And that's sort of where we are heading.
Except that we will get there by going through the maze many times, using
trial and error. Enough of the maze.
We can write these two value functions recursively: The value function
now is just the expected value of the immediate reward plus the expectation
of the value function where you go to. Let me show you how this works.
But since there are potentially two probabilistic elements (the policy and
the transitions) we can make this more explicit using the notation from
Section 11.9:
These final results are the Bellman equation, relating the current state-value
function to the state-value function at the next step.
11.13. OPTIMAL POLICY 193
Note that this recursive relationship between the value function now and
the value function at the next step is only possible because of the simple
geometric discounting of future rewards. That’s what I meant when I said
earlier that geometrical discounting has nice mathematical properties.
The same process also gives us the recursive equation for the action-value
function
Qtt{s,a) — y ^ ss' ss
a'
The whole point of setting up this problem is to find the optimal policy
at each state. So we are seeking to maximize v.^{s) over all policies tt;
v^{s) = maxw^(s).
TT
and
= maxQ*(s,a).
a
And if we can find the next action then in principle the job is done because of
the recursive nature of the Bellman equation. This is the benefit of working
with <5, isolating the immediate next action.
1 if a = argrnax^ Q*{s,a)
7T*(a|s) = 0 otherwise
194 CHAPTER 11. REINFORCEMENT LEARNING
Once we have found the optimal policy then typically the 7r(a|s) becomes
deterministic, with probability of one for the optimal action and zero for other
actions.
For completeness, here is the action-value function Q* for our earlier
maze. That is, the action-value function for the optimal policy. Meaning
the value for taking an action and thereafter finding the optimal policy.
A B c
1 -4 -5
-3
-4
-2
-3
3 -2 -3 -1
The position of the numbers in the cells just refers to the move. So the
-4 at the top of Cell B2 means the value if you go up and thereafter do
what is optimal.
It's worth taking a moment out to look at the role of probability here. We
started out by talking about a policy being probabilistic, hence all those earlier
expectations. We've now moved on to talking about a deterministic policy.
That doesn't mean all probabilistic elements have necessarily disappeared.
• There could well be elements of the environment that are always ran
dom. This is the case with the bandits and Blackjack. In both cases
there will be deterministic optimal policies (based on the state in Black
jack) but the consequences still have random components. You may
be on a losing streak, but that shouldn’t change your policy. This is
where we are in our discussion of the theory so far.
11.15. GETTING MORE SOPHISTICATED 2 195
• There might genuinely be random elements that are part of the optimal
policy. Rock, Paper, Scissors is the obvious example. However we
aren't going to look at such cases here.
have complete returns, which means that the MDP terminates: An episodic
MDP has a well-defined start and end, it doesn’t go on forever. You start
the game, play, and then finish. That's one episode, and then you start all
over again.
You couldn't easily calculate an expected return if the game never ter
minated.
We will use T to denote the step at which the game finishes. T can be
different for each episode, i.e. each time we play the game. So now
T-t-l
Gt = n+i -h 7n+2 -h 7^n+3 -h ... + 7 TT-
for our chosen policy. But this expected value will (have to) be the empirical
mean return rather than some theoretical mean, since we aren't given the
probabilities.
In doing a simulation there are two important points that we have to
address:
• Can we be sure that we will visit each state enough times to be able
to get a decent estimate of the expectation? The standard deviation
of the estimated expectation for a state decreases in proportion to the
inverse of the square root of the number of episodes for which we've
been in that state.
■Si ^ S2 ^ Si ^ S3 ^ Si ^ terminates,
S2 ^ S'4 ^ S2 ^ Si ^ terminates,
ru) J-l,
Si S2 terminates.
The mean value that we want to calculate is Just the simple arithmetic
average of al l values so far. That takes the form
n(si)
1
(i)
v{st) (11.5)
n{st) i=l
198 CHAPTER 11. REINFORCEMENT LEARNING
(*)
where I've used G to mean the return realised in the i*'’ episode in which
we have experienced state St, and n{st) is the total number of episodes in
which we have experienced state Sf so far. At the end of the next episode
in which we visit state St we'l l get an updated expectation
n(.5,,)+l
1 (*)
Vnew{st) —
n{st) + 1 E cl
i=l
(n(si)+l)
= V
o
ld(«t)+/3(G[ — V
id(-^«)) >
o
1
where /3 n(.‘it)+l ■
This is a simple updating that only requires remembering the current
value and the relevant number of episodes (in which we have reached that
state). But it can be simplified further.
What would it mean if instead of having (3 depending on the number
n(st) we instead made it a small parameter? This would tie in with the sort
of updating we've see in other numerical techniques, gradient descent for
example. Would it be cheating? It would certainly make things a bit easier,
not having to keep track of n{st).
Well, it wouldn’t matter as long as the environment is not changing, and
that j3 did eventually decrease to zero. If the environment is stationary then
either way we would converge to the right average.
But if the environment is changing, i.e. is non stationary, then we should
use an average that takes this into account. And that what's the exponen
tially weighted average does. Of course, then there is the matter of choosing
the parameter j3 that reflects the timescale over which the environment is
changing, the right half life. But that's a modelling issue.
The rule for updating thus becomes
We get in our car, start the motor and a warning sign comes up on
the dashboard. In Monte Carlo policy valuation we don’t let this affect our
expected journey time (our value) until we get to our destination (or not).
But we've experienced this warning light a couple of times before. In
temporal difference (TD) learning our expected journey time is possibly im
mediately changed.
Notice that in both MC and TD I am not saying that we change our
policy (go to the garage for example), because at the moment we are still in
policy-evaluation mode.
TD learning uses what is called online updating. This means that updat
ing is done during the episode, we don’t wait until the end to update values.
In contrast offline updating is when we wait until the end of an episode before
updating values.
We don’t need an episode to terminate before learning from it. In fact
TD learning doesn’t even need the sequences to be episodic. Unlike MC we
can learn from neverending sequences.
n+i +7'!^(st-M)-
This is called the TD target. And here w(sti) is the sample that we have
found so far, not the true expected return. (Which means that this can be
a source of bias.) So the updating rule is
One issue we will have to address frequently is how to avoid getting stuck
in a policy that is not optimal. If we always choose what seems to be the
best policy then we might be premature. We might not have explored other
actions sufficiently, or perhaps have an inaccurate estimate of values. If we
always choose what we think is optimal it is called greedy.
It is much better to spend some time exploring the environment before
homing in on what you think might be optimal. A fully greedy policy might
be be counterproductive. We might be lucky to win when we first stand on
a 14 when the dealer has an eight. We would then never experience hitting
14 when the dealer has an eight.
A simple way of avoiding getting stuck is to use e-greedy exploration.
This means that with probability 1 — e, for some chosen e, you take the
greedy action but with probability e you choose any one of the possible
actions, all being equally likely. In that way you will explore the suboptimal
solutions, you never know they might eventually turn out to be optimal.
Of course, as our policy improves we will need to decrease e otherwise
we will forever have some randomness in our policy.
11.20. SARSA 201
11.20 Sarsa
We've seen enough updating rules now that I can cut straight to the
chase. The method is called sarsa, which stands for “state action reward
state action."
This updating is done at every step of every episode. The actions that are
chosen are from an e-greedy policy.
Let’s go through the algorithm.
Step 0: initialize
First choose starting values for Q{s,a). Although the rate of con
vergence will not be affected by this choice the time to convergence
to within some error might. Perhaps make everything zero.
Hence (11.7)
Here I have emphasised the updating to the state and the action as
well. See Figure 11.13.
action o'
Q(s,a) action a
state s collect r
Repeat Steps 2 and 3 until termination of the episode and then return
to Step 1 for a new episode.
You can see where the name comes from, St, at, rj+i, St+i, at+\.
11.21 Q Learning
Step 0: Initialize
First choose starting values for Q{s,a). Although the rate of con
vergence will not be affected by this choice the time to convergence
to within some error might. Perhaps make everything zero.
I've used a" to mean all the actions you could possibly take. The
action taken according to the behaviour policy would be a' which
204 CHAPTER 11. REINFORCEMENT LEARNING
o'
ibehaviour policy;
action a
Q{s,a)
state s collect r
^tate s',
S
^action a"
^(maximizes Q(s',a"))
Repeat Steps 2 and 3 until termination of the episode and then return
to Step 1 for a new episode.
3. Each player places their bet in front of them before any cards are dealt
4. The dealer deals one card face up to each player and one to themselves.
Another face-up card is then dealt to all players and another card, this
time face down (the hole card), to the dealer
5. The count of a hand is the sum of the values of al l the individual cards;
Aces are one or 11, 2-9 count as the number of pips, 10, Jack, Queen
and King count as ten
6. The goal of each player is to beat the dealer, i.e. get a higher count
than the dealer
11.22. EXAMPLE: BLACKJACK 205
7. Each player in turn gets to take (Hit) another card, an offer that is
repeated, or Stand
8. If a player goes over 21 they have bust and lose their bet immediately
9. If a player has an Ace which they are treating as 11 for the count and
then a hit takes them over 21 then they simply treat that Ace as a
value one going forward
10. A hand without any ace, or a hand in which aces have a value of one,
is called a hard hand. A hand in which there is an ace that is counted
as 11 is called a soft hand
12. A dealer's natural beats a player who has 21 with three or more cards
13. A player can double their bet before taking any cards, on condition
that they then only take one extra card
14. If a player is initially dealt two cards of the same value they can be
split to make two new hands. The player puts up another bet of the
same size and the dealer deals another card to each of the two hands
15. If the dealer's upcard is an Ace then he can offer insurance. This is a
side bet paying 2:1 if his hole card is a ten count
16. Once all players in turn have stood or gone bust it is the dealer's turn
17. The dealer in a casino does not have the flexibility in whether to hit
or stand that a player has. Typically they have to hit a count of 16 or
less and stand on 17 or more
18. If the player beats the dealer they get paid off at evens
19. If player and dealer have the same final count then it is a tie or push
Those are the basic rules. In some casinos these rules are fine tuned:
18. In some casinos a dealer with a count of 17 including an Ace will hit.
So the dealer will hit a soft 17
19. In some casinos the dealer's hole card isn't dealt until after all players
have played
You can see from the rules that the player has a lot of decisions to make:
Hit; Stand; Split; Take out insurance; Count Ace as one or 11; And how
much to bet. This is all to the player's advantage, they can optimize their
actions. The 3:2 payoff for a natural is also to the player's advantage. On
the other hand the dealer is constrained to always play the same way: Hit
16 or less; Stand on 17 or more. Yet despite all this asymmetry in favour of
the player the bank still has a significant edge. And that's because there is
one asymmetry in favour of the house: If the player goes bust they lose their
bet, even if the dealer goes bust later.
ber of decks. That means that the probability of dealing any particular card
never changes. The state is thus represented by our current count, whether
we have at least one Ace and the dealer's upcard.
How many states are there? The dealer could have A-10. We could have
12-21 (we would always hit 11 or less). Although we would never hit 21 we
need to know whether we have 21 because we might reach that state from
another state. And we may or may not have an Ace that's valued at one.
That makes a total of 10 x 10 x 2 = 200 states. That's fewer than Noughts
and Crosses.
5. Two grids of 10 x 10, one for the state-value function for hard hands
and one for soft hands
Figure 11.15: The state-value, v{s), function for hard hands for the policy
in which the player fol lows the same rules as the dealer: MC.
11.22. EXAMPLE: BLACKJACK 209
In Figure 11.15 is shown the state-value function for hard hands after a
few thousand episodes.
Please note that I am not claiming that these are the final results, or that
they are correct. Before going to a casino read the book mentioned at the
end of this chapter. By the way, if you follow the dealer's strategy you are
going to lose quite quickly. (Look at all the negative numbers in 11.15.)
Without knowing the transition probabilities, that is without knowing the
environment (as we are assuming), then we cannot use the information in
Figure 11.15 to tell us how to improve our policy from the one we have
adopted. For that we need the action-value function. That will come soon.
TD learning
w(13. Soft, 7)^ ^;(13, Soft,7)+ /?(0 + v{17, Soft,7)- v{U,Soft, 7)).
(The 'y(17,Soft,7) at this point is whatever it was for the previous episode.)
Then
And finally
The state-value function for hard hands wil l be similar to that shown in
Figure 11.15.
Now let’s move on to finding the optimal policy for Blackjack.
Sarsa
Finding the optimal policy using sarsa first involves going over to the
action-value function: Q(Player, Hard/Soft, Dealer, Hit/Stand).
The updating rule is now
2. Then the value of g((13.Soft,7), Hit) versus g((13. Soft, 7), Stand)
tells us whether to hit or stand. Although with probability e we toss a
coin to make that decision. We decide to hit
5. Etc.
Deafer
T
2 3 4 5 6 7 8 9 10 A
F 14
QU ,
I
• I
T i 15
I
16
-f __4
17
18 ■f'
•f-
Stf
19
t 1*0.
20 [
21 I
Figure 11.16: One part of the updating for a hand of Blackjack: Sarsa.
And there would be similar updating for each of the other realised state-
action pairs. The final one would be
12
[ H H H H H H H H H H
H 13 H S S s S H H H H H
A 14 S S S S s H H H H H
R 15 s S S S S H H H H H
D 16 S S S S S H H H H H
17 S s s s S s s s s s
18 S s s s s s s s s s
19 S s s s s s s s s s
20 S s s s s s s s s s
21 S s s s s s s s s s
Figure 11.17: Results from sarsa for hard hands. The recognised optimal
strategy for hitting given by the boxes.
Results for hard hands after a few million episodes are shown in Figure
11.17. My results are labelled as FI or S for Hit or Stand, and the Hits are
grey. All I've done here is take the Q function and label the cell according to
which is greater the Q for Hit or the Q for Stand in that state. The results
are homing in on the classical, recognised optimal strategy for hitting given
by the bordered boxes in that figure. I would need to run it for longer to
perfect convergence. You can find the classical results in any decent book
on Blackjack strategy.
Q learning
5. Etc.
And so:
g((13,Soft,7),Hit)).
And there would be similar updating for each of the other realised state-
action pairs.
This particular updating is shown in Figure 11.18. This differs from the
equivalent Sarsa diagram, 11.16, in two ways:
• First, we don’t care whether our action in state Soft 17 is Hit or Stand,
hence the question marks in the diagram.
• Second, the information that gets passed to g((13. Soft, 7), Hit) is the
maximum over the possible action-value functions in state Soft 17.
214 CHAPTER 11. REINFORCEMENT LEARNING
Deafer
2 3 4 5 6 7 8 9 10 A
F
13
14
ESii::r I
.1 T
T 15 i
16
■f
j t
i r
17 . 1 ake the max.
T
18
19
1
I
20
21
X
tfj
I
i
f
Figure 11.18: One part of the updating for a hand of Blackjack: Q learning.
I cannot leave you without giving a few clues as to how to actually win
at Blackjack.
We've seen how you can find the optimal policy with sarsa and Q learning.
But still that's not enough.
To win at Blackjack you need, as wel l as the optimum strategy,
No one expects you to keep track of everything that has been dealt.
So various simple techniques have been devised to help you track the most
important information. One way of doing this is to keep a running count in
your head of the following cards, Aces, twos,..., sixes, tens as follows.
With a fresh shoe the running count is zero. Every time you see a two-six
dealt (to other players, yourself, or to the dealer) you add 1 to that running
count. And for every Ace or ten you see you subtract one. So if a round goes
like this: an Ace, a two, two threes, a six, a nine, an eight, one Jack, one
Queen and a seven, then the running count would go -1,0,1,2,3,3,3,2,1,1,
ending up at plus one. I.e. the deck is now ever so slightly better than before.
And you keep this count going, only ever going back to zero for a new shoe.
If the running count gets large enough relative to the number of cards left
in the shoe then it's time to increase your bet size. (Of course then you get
thrown out of the casino because the dealer is probably also counting cards,
as it’s called, and now he knows you’re some kinda wise guy.)
Optimal betting
I said above that you have to adjust your bet, increasing it when the
deck is favourable and decreasing the bet when it isn't. Then you might be
able to win. There is some mathematics to this, going under the name of
Kelly criterion and Kelly fraction. There's a simple formula, you can derive
it on the back of an envelope. However since this isn’t really a book about
gambling I'm going to stop here, and direct you to the book listed at the
end of this chapter.
My examples have all had relatively small state spaces, and Just a small
choice of actions. More often than not in real problems you will be faced with
very large state spaces and/or actions. For example, you might be working
with a complex game such as chess. Often the state space will have infinite
dimensions, for example if there is a continuum of states. Or there will be a
continuum of actions.
We can approach such a problem in reinforcement learning by approxi
mating our value functions using simple or complicated functional forms with
a set of parameters. These parameters are the quantities that are updated
during the learning process.
We must convert the potentially infinite-dimensional states to finite
dimensional feature vectors. This process is called feature extraction.
216 CHAPTER 11. REINFORCEMENT LEARNING
Further Reading
There is just one book you will need for reinforcement learning, the classic
and recently updated Reinforcement Learning: An Introduction by Richard
Sutton and Andrew G. Barto.
But the one book you must get, although it’s nothing about machine
learning, is Beat The Dealer by Ed Thorp, published by Random House. It's
a classic if you are into card games, or gambling, or finance, or to learn
about other methods for approaching the mathematics of Blackjack. Or just
because it’s a great read.
Datasets
I've used a variety of data sets in this book to demonstrate the various
methods. Some of these data sets are available to the public and some were
private. I only occasionally adjusted the data and then only to identify and
explain some idea as compactly as possible. I am not claiming that whatever
method I employed on a particular problem was the best for that problem.
But equally they were probably not the worst.
Good places to start looking for data are
https://toolbox.google.com/datasetsearch
and https://www.kaggle.com. You'll need to register with the latter.
Here are some datasets you might want to download to play around with.
Some I used, some I didn't.
Titanic
Irises
The famous iris dataset that I used can be found everywhere, for example
https://www.haggle.com/uciml/iris
217
218 DATASETS
MNIST
Mushrooms
https://www.kaggle.com/uciml/mushroom-classification
This contains data for over 8,000 mushrooms with information about their
geometry, colours, etc. and whether they are edible. Rather you than me.
Financial
Plenty of historical stock, index and exchange rate data can be found at
https://finamce.yahoo.com
For specific stocks go to
https://finance.yahoo.com/quote/[STOCKSYMBOLHERE]/history/
The Bank of England has economic indicators, interest rates etc. at
https://www.bankofengland.co.uk/statistics/
219
Banknotes
Animals
I hope you enjoyed that as much as I did. My Job is over, so I can sit
down with that single malt. But for you the work is only Just beginning.
Please don't be a stranger, email me, paulOwilmott.com. Perhaps you’l l
find typos, or maybe even major mistakes. Or maybe you'll want clarification
of some technical point or have ideas for new material. Probably you'll end
up teaching me, and that’s how it should be. You can also sign up at
wilmott.com, it's free! There we can all get together to shoot the breeze,
and plan the fightback for when the machines take over.
P.S. I’ll leave the final words to The Machine, from which (whom?) I
have learned so much.
Final Words
Oh boy! It was a very enjoyable shoot. Thanks for reading! What did
you guys think of the new series? Was this a fun experience or did it take
some time to get started?
What makes a good leader? A dear vision and focus. Leadership is what
really matters — the vision of the future. It matters how you work it.
To truly understand how we're going to get there, we need to understand
how you are going to be able to execute on your vision. Don't let a lack of
leadership distract you from your true purpose. Your first task is to define
where you are in this new world, and then to make those changes.
The new world is the new vision.
It is not the future, but it's the future.
221
Index
Numbers in bold are pages with major entries about the subject.
223
224 INDEX
85483948R00135
MACHINE LEARNING; AN APPUED MATHEMATICS INTRODUCTION
Machine Learning:An Applied Mathematics Introduction covers the essential mathematics
behind all of the following topics
> Paul Wilmott has been called "cult derivatives lecturer" by the Financial Times
and "financial mathematics guru"by the BBC.
ISBN
5 1999
781916 081604