The Hundred Page Machine Learning 2019
The Hundred Page Machine Learning 2019
In this section, I briefly explain how supervised learning works so that you have the picture
of the whole process before we go into detail. I decided to use supervised learning as an
example because it’s the type of machine learning most frequently used in practice.
The supervised learning process starts with gathering the data. The data for supervised
learning is a collection of pairs (input, output). Input could be anything, for example, email
messages, pictures, or sensor measurements. Outputs are usually real numbers, or labels (e.g.
“spam”, “not_spam”, “cat”, “dog”, “mouse”, etc). In some cases, outputs are vectors (e.g.,
four coordinates of the rectangle around a person on the picture), sequences (e.g. [“adjective”,
“adjective”, “noun”] for the input “big beautiful car”), or have some other structure.
Let’s say the problem that you want to solve using supervised learning is spam detection.
You gather the data, for example, 10,000 email messages, each with a label either “spam” or
“not_spam” (you could add those labels manually or pay someone to do that for us). Now,
you have to convert each email message into a feature vector.
The data analyst decides, based on their experience, how to convert a real-world entity, such
as an email message, into a feature vector. One common way to convert a text into a feature
vector, called bag of words, is to take a dictionary of English words (let’s say it contains
20,000 alphabetically sorted words) and stipulate that in our feature vector:
• the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
• the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
• ...
• the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.
You repeat the above procedure for every email message in our collection, which gives
us 10,000 feature vectors (each vector having the dimensionality of 20,000) and a label
(“spam”/“not_spam”).
Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text. Some learning algorithms require transforming labels into numbers.
For example, some algorithms require numbers like 0 (to represent the label “not_spam”)
and 1 (to represent the label “spam”). The algorithm I use to illustrate supervised learning is
called Support Vector Machine (SVM). This algorithm requires that the positive label (in
our case it’s “spam”) has the numeric value of +1 (one), and the negative label (“not_spam”)
has the value of ≠1 (minus one).
At this point, you have a dataset and a learning algorithm, so you are ready to apply
the learning algorithm to the dataset to get the model.
SVM sees every feature vector as a point in a high-dimensional space (in our case, space
wx ≠ b = 0,
where the expression wx means w(1) x(1) + w(2) x(2) + . . . + w(D) x(D) , and D is the number
of dimensions of the feature vector x.
(If some equations aren’t clear to you right now, in Chapter 2 we revisit the math and
statistical concepts necessary to understand them. For the moment, try to get an intuition of
what’s happening here. It all becomes more clear after you read the next chapter.)
Now, the predicted label for some input feature vector x is given like this:
y = sign(wx ≠ b),
where sign is a mathematical operator that takes any value as input and returns +1 if the
input is a positive number or ≠1 if the input is a negative number.
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find
the optimal values wú and bú for parameters w and b. Once the learning algorithm identifies
these optimal values, the model f (x) is then defined as:
f (x) = sign(wú x ≠ bú )
Therefore, to predict whether an email message is spam or not spam using an SVM model,
you have to take a text of the message, convert it into a feature vector, then multiply this
vector by wú , subtract bú and take the sign of the result. This will give us the prediction (+1
means “spam”, ≠1 means “not_spam”).
Now, how does the machine find wú and bú ? It solves an optimization problem. Machines
are good at optimizing functions under constraints.
So what are the constraints we want to satisfy here? First of all, we want the model to predict
the labels of our 10,000 examples correctly. Remember that each example i = 1, . . . , 10000 is
given by a pair (xi , yi ), where xi is the feature vector of example i and yi is its label that
takes values either ≠1 or +1. So the constraints are naturally:
• wxi ≠ b Ø 1 if yi = +1, and
• wxi ≠ b Æ ≠1 if yi = ≠1
1
2
b
w
—
x
w
b
—
x
w
1
—
b
—
x
w
x(1)
b
w
We would also prefer that the hyperplane separates positive examples from negative ones with
the largest margin. The margin is the distance between the closest examples of two classes,
as defined by the decision boundary. A large margin contributes to a better generalization,
that is how well the model will classify new examples in the future. ToÒachieve that, we need
qD
to minimize the Euclidean norm of w denoted by ÎwÎ and given by j=1 (w
(j) )2 .
So, the optimization problem that we want the machine to solve looks like this:
Minimize ÎwÎ subject to yi (wxi ≠ b) Ø 1 for i = 1, . . . , N . The expression yi (wxi ≠ b) Ø 1
is just a compact way to write the above two constraints.
The solution of this optimization problem, given by wú and bú , is called the statistical
model, or, simply, the model. The process of building the model is called training.
For two-dimensional feature vectors, the problem and the solution can be visualized as shown
in fig. 1. The blue and orange circles represent, respectively, positive and negative examples,
and the line given by wx ≠ b = 0 is the decision boundary.
Why, by minimizing the norm of w, do we find the highest margin between the two classes?
Geometrically, the equations wx ≠ b = 1 and wx ≠ b = ≠1 define two parallel hyperplanes,
2
as you see in fig. 1. The distance between these hyperplanes is given by ÎwÎ , so the smaller
Why is a machine-learned model capable of predicting correctly the labels of new, previously
unseen examples? To understand that, look at the plot in fig. 1. If two classes are separable
from one another by a decision boundary, then, obviously, examples that belong to each class
are located in two different subspaces which the decision boundary creates.
If the examples used for training were selected randomly, independently of one another, and
following the same procedure, then, statistically, it is more likely that the new negative
example will be located on the plot somewhere not too far from other negative examples.
The same concerns the new positive example: it will likely come from the surroundings of
other positive examples. In such a case, our decision boundary will still, with high probability,
separate well new positive and negative examples from one another. For other, less likely
situations, our model will make errors, but because such situations are less likely, the number
of errors will likely be smaller than the number of correct predictions.
Intuitively, the larger is the set of training examples, the more unlikely that the new examples
will be dissimilar to (and lie on the plot far from) the examples used for training. To minimize
the probability of making errors on new examples, the SVM algorithm, by looking for the
largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far
as possible from examples of both classes.