A Course in Machine Learning 1648562733
A Course in Machine Learning 1648562733
Shared by LOSC
Machine Learning
Published by TODO
http://hal3.name/courseml/
TODO. . . .
1 Decision Trees 8
3 The Perceptron 39
4 Practical Issues 53
6 Linear Models 86
Notation 189
Bibliography 190
Index 191
Shared by LOSC
0.4 Acknowledgements
1 | D ECISION T REES
guesses about some unobserved property of some object, based on • Evaluate whether a use of test data
is “cheating” or not.
observed properties of that object.
The first question we’ll ask is: what does it mean to learn? In
order to develop learning machines, we must know what learning
actually means, and how to determine success (or failure). You’ll see
this question answered in a very limited learning setting, which will
be progressively loosened and adapted throughout the rest of this
book. For concreteness, our focus will be on a very simple model of
Dependencies: None.
learning called a decision tree.
Alice has just begun taking a course on machine learning. She knows
that at the end of the course, she will be expected to have “learned”
all about this topic. A common way of gauging whether or not she
has learned is for her teacher, Bob, to give her a exam. She has done
well at learning if she does well on the exam.
But what makes a reasonable exam? If Bob spends the entire
semester talking about machine learning, and then gives Alice an
exam on History of Pottery, then Alice’s performance on this exam
will not be representative of her learning. On the other hand, if the
exam only asks questions that Bob has answered exactly during lec-
tures, then this is also a bad test of Alice’s learning, especially if it’s
an “open notes” exam. What is desired is that Alice observes specific
decision trees 9
examples from the course, and then has to answer new, but related
questions on the exam. This tests whether Alice has the ability to
generalize. Generalization is perhaps the most central concept in
machine learning.
As a running concrete example in this book, we will use that of a
course recommendation system for undergraduate computer science
students. We have a collection of students and a collection of courses.
Each student has taken, and evaluated, a subset of the courses. The
evaluation is simply a score from −2 (terrible) to +2 (awesome). The
job of the recommender system is to predict how much a particular
student (say, Alice) will like a particular course (say, Algorithms).
Given historical data from course ratings (i.e., the past) we are
trying to predict unseen ratings (i.e., the future). Now, we could
be unfair to this system as well. We could ask it whether Alice is
likely to enjoy the History of Pottery course. This is unfair because
the system has no idea what History of Pottery even is, and has no
prior experience with this course. On the other hand, we could ask
it how much Alice will like Artificial Intelligence, which she took
last year and rated as +2 (awesome). We would expect the system to
predict that she would really like it, but this isn’t demonstrating that
the system has learned: it’s simply recalling its past experience. In
the former case, we’re expecting the system to generalize beyond its
experience, which is unfair. In the latter case, we’re not expecting it
to generalize at all.
This general set up of predicting the future based on the past is
at the core of most machine learning. The objects that our algorithm
will make predictions about are examples. In the recommender sys-
tem setting, an example would be some particular Student/Course
pair (such as Alice/Algorithms). The desired prediction would be the
rating that Alice would give to Algorithms.
To make this concrete, Figure 1.1 shows the general framework of
induction. We are given training data on which our algorithm is ex-
pected to learn. This training data is the examples that Alice observes
in her machine learning course, or the historical ratings data for Figure 1.1: The general supervised ap-
proach to machine learning: a learning
the recommender system. Based on this training data, our learning algorithm reads in training data and
algorithm induces a function f that will map a new example to a cor- computes a learned function f . This
function can then automatically label
responding prediction. For example, our function might guess that future text examples.
f (Alice/Machine Learning) might be high because our training data
said that Alice liked Artificial Intelligence. We want our algorithm
to be able to make lots of predictions, so we refer to the collection
of examples on which we will evaluate our algorithm as the test set.
The test set is a closely guarded secret: it is the final exam on which
our learning algorithm is being tested. If our algorithm gets to peek
at it ahead of time, it’s going to cheat and do better than it should. Why is it bad if the learning algo-
? rithm gets to peek at the test data?
10 a course in machine learning
learning problems, we will begin with the simplest case: binary clas-
sification.
Suppose that your goal is to predict whether some unknown user
will enjoy some unknown course. You must simply answer “yes” or
“no.” In order to make a guess, you’re allowed to ask binary ques-
tions about the user/course under consideration. For example:
You: Is the course under consideration in Systems?
Me: Yes
You: Has this student taken any other Systems courses?
Me: Yes Figure 1.2: A decision tree for a course
You: Has this student liked most previous Systems courses? recommender system, from which the
in-text “dialog” is drawn.
Me: No
You: I predict this student will not like this course.
The goal in learning is to figure out what questions to ask, in what
order to ask them, and what answer to predict once you have asked
enough questions.
The decision tree is so-called because we can write our set of ques-
tions and guesses in a tree format, such as that in Figure 1.2. In this
figure, the questions are written in the internal tree nodes (rectangles)
and the guesses are written in the leaves (ovals). Each non-terminal
node has two children: the left child specifies what to do if the an-
swer to the question is “no” and the right child specifies what to do if
it is “yes.”
In order to learn, I will give you training data. This data consists
of a set of user/course examples, paired with the correct answer for
these examples (did the given user enjoy the given course?). From
this, you must construct your questions. For concreteness, there is a
small data set in Table ?? in the Appendix of this book. This training
data consists of 20 course rating examples, with course ratings and
answers to questions that you might ask about this pair. We will
interpret ratings of 0, +1 and +2 as “liked” and ratings of −2 and −1
as “hated.”
In what follows, we will refer to the questions that you can ask as
features and the responses to these questions as feature values. The
rating is called the label. An example is just a set of feature values.
And our training data is a set of examples, paired with labels.
There are a lot of logically possible trees that you could build,
even over just this small number of features (the number is in the
millions). It is computationally infeasible to consider all of these to
try to choose the “best” one. Instead, we will build our decision tree
greedily. We will begin by asking:
If I could only ask one question, what question would I ask?
You want to find a feature that is most useful in helping you guess
whether this student will enjoy this course.1 A useful way to think
about this is to look at the histogram of labels for each feature. This
is shown for the first four features in Figure 1.3. Each histogram
shows the frequency of “like”/“hate” labels for each possible value
of an associated feature. From this figure, you can see that asking
the first feature is not useful: if the value is “no” then it’s hard to
guess the label; similarly if the answer is “yes.” On the other hand,
asking the second feature is useful: if the value is “no,” you can be
pretty confident that this student will hate this course; if the answer
is “yes,” you can be pretty confident that this student will like this
course.
More formally, you will consider each feature in turn. You might
consider the feature “Is this a System’s course?” This feature has two
possible value: no and yes. Some of the training examples have an
answer of “no” – let’s call that the “NO” set. Some of the training
examples have an answer of “yes” – let’s call that the “YES” set. For
each set (NO and YES) we will build a histogram over the labels.
This is the second histogram in Figure 1.3. Now, suppose you were
to ask this question on a random example and observe a value of
“no.” Further suppose that you must immediately guess the label for
this example. You will guess “like,” because that’s the more preva-
lent label in the NO set (actually, it’s the only label in the NO set).
Alternatively, if you recieve an answer of “yes,” you will guess “hate”
because that is more prevalent in the YES set.
So, for this single feature, you know what you would guess if you
had to. Now you can ask yourself: if I made that guess on the train-
ing data, how well would I have done? In particular, how many ex-
amples would I classify correctly? In the NO set (where you guessed
“like”) you would classify all 10 of them correctly. In the YES set
(where you guessed “hate”) you would classify 8 (out of 10) of them
correctly. So overall you would classify 18 (out of 20) correctly. Thus,
we’ll say that the score of the “Is this a System’s course?” question is
18/20. How many training examples
You will then repeat this computation for each of the available would you classify correctly for
? each of the other three features
features to us, compute the scores for each of them. When you must from Figure 1.3?
choose which feature consider first, you will want to choose the one
with the highest score.
But this only lets you choose the first feature to ask about. This
is the feature that goes at the root of the decision tree. How do we
choose subsequent features? This is where the notion of divide and
conquer comes in. You’ve already decided on your first feature: “Is
this a Systems course?” You can now partition the data into two parts:
the NO part and the YES part. The NO part is the subset of the data
on which value for this feature is “no”; the YES half is the rest. This
is the divide step.
decision trees 13
The conquer step is to recurse, and run the same routine (choosing
the feature with the highest score) on the NO set (to get the left half
of the tree) and then separately on the YES set (to get the right half of
the tree).
At some point it will become useless to query on additional fea-
tures. For instance, once you know that this is a Systems course,
you know that everyone will hate it. So you can immediately predict
“hate” without asking any additional questions. Similarly, at some
point you might have already queried every available feature and still
not whittled down to a single answer. In both cases, you will need to
create a leaf node and guess the most prevalent answer in the current
piece of the training data that you are looking at.
Putting this all together, we arrive at the algorithm shown in Al-
gorithm 1.3.2 This function, DecisionTreeTrain takes two argu- 2
There are more nuanced algorithms
for building decision trees, some of
which are discussed in later chapters of
this book. They primarily differ in how
they compute the score funciton.
14 a course in machine learning
ments: our data, and the set of as-yet unused features. It has two
base cases: either the data is unambiguous, or there are no remaining
features. In either case, it returns a Leaf node containing the most
likely guess at this point. Otherwise, it loops over all remaining fea-
tures to find the one with the highest score. It then partitions the data
into a NO/YES split based on the best feature. It constructs its left
and right subtrees by recursing on itself. In each recursive call, it uses
one of the partitions of the data, and removes the just-selected feature
from consideration. Is Algorithm 1.3 guaranteed to
The corresponding prediction algorithm is shown in Algorithm 1.3. ? terminate?
This function recurses down the decision tree, following the edges
specified by the feature values in some test point. When it reaches a
leaf, it returns the guess associated with that leaf.
TODO: define outlier somewhere!
As you’ve seen, there are several issues that we must take into ac-
count when formalizing the notion of learning.
N
1
ê ,
N ∑ `(yn , f (xn )) (1.8)
n =1
That is, our training error is simply our average error over the train-
ing data. Verify by calculation that we
Of course, we can drive ê to zero by simply memorizing our train- can write our training error as
E( x,y)∼ D `(y, f ( x)) , by thinking
ing data. But as Alice might find in memorizing past exams, this
? of D as a distribution that places
might not generalize well to a new exam! probability 1/N to each example in
This is the fundamental difficulty in machine learning: the thing D and probabiliy 0 on everything
else.
we have access to is our training error, ê. But the thing we care about
minimizing is our expected error e. In order to get the expected error
down, our learned function needs to generalize beyond the training
data to some future data that it might not have seen yet!
So, putting it all together, we get a formal definition of induction
machine learning: Given (i) a loss function ` and (ii) a sample D
from some unknown distribution D , you must compute a function
f that has low expected error e over D with respect to `.
More formally, if D is a discrete probability distribution, then this expectation can be expanded as:
This is exactly the weighted average loss over the all ( x, y) pairs in D , weighted by their probability
(namely, D( x, y)) under this distribution D .
In particular, if D is a finite discrete distribution, for instance one defined by a finite data set
{( x1 , y1 ), . . . , ( x N , y N ) that puts equal weight on each example (in this case, equal weight means proba-
bility 1/N), then we get:
In the case that the distribution is continuous, we need to replace the discrete sum with a continuous
integral over some space Ω:
Z
E( x,y)∼D [`(y, f ( x))] = D( x, y)`(y, f ( x))dxdy (1.7)
Ω
This is exactly the same but in continuous space rather than discrete space.
The most important thing to remember is that there are two equivalent ways to think about expections:
1. The expectation of some function g is the weighted average value of g, where the weights are given by
the underlying probability distribution.
2. The expectation of some function g is your best guess of the value of g if you were to draw a single
item from the underlying probability distribution.
Figure 1.4:
18 a course in machine learning
Suppose that, after graduating, you get a job working for a company
that provides personalized recommendations for pottery. You go in
and implement new algorithms based on what you learned in your
machine learning class (you have learned the power of generaliza-
tion!). All you need to do now is convince your boss that you have
done a good job and deserve a raise!
How can you convince your boss that your fancy learning algo-
rithms are really working?
Based on what we’ve talked about already with underfitting and
overfitting, it is not enough to just tell your boss what your training
error is. Noise notwithstanding, it is easy to get a training error of
zero using a simple database query (or grep, if you prefer). Your boss
will not fall for that.
The easiest approach is to set aside some of your available data as
22 a course in machine learning
“test data” and use this to evaluate the performance of your learning
algorithm. For instance, the pottery recommendation service that you
work for might have collected 1000 examples of pottery ratings. You
will select 800 of these as training data and set aside the final 200
as test data. You will run your learning algorithms only on the 800
training points. Only once you’re done will you apply your learned
model to the 200 test points, and report your test error on those 200
points to your boss.
The hope in this process is that however well you do on the 200
test points will be indicative of how well you are likely to do in the
future. This is analogous to estimating support for a presidential
candidate by asking a small (random!) sample of people for their
opinions. Statistics (specifically, concentration bounds of which the
“Central limit theorem” is a famous example) tells us that if the sam-
ple is large enough, it will be a good representative. The 80/20 split
is not magic: it’s simply fairly well established. Occasionally people
use a 90/10 split instead, especially if they have a lot of data. If you have more data at your dis-
The cardinal rule of machine learning is: never touch your test ? posal, why might a 90/10 split be
preferable to an 80/20 split?
data. Ever. If that’s not clear enough:
key identifiers for hyperparameters (and the main reason that they
cause consternation) is that they cannot be naively adjusted using the
training data.
In DecisionTreeTrain, as in most machine learning, the learn-
ing algorithm is essentially trying to adjust the parameters of the
model so as to minimize training error. This suggests an idea for
choosing hyperparameters: choose them so that they minimize train-
ing error.
What is wrong with this suggestion? Suppose that you were to
treat “maximum depth” as a hyperparameter and tried to tune it on
your training data. To do this, maybe you simply build a collection
of decision trees, tree0 , tree1 , tree2 , . . . , tree100 , where treed is a tree
of maximum depth d. We then computed the training error of each
of these trees and chose the “ideal” maximum depth as that which
minimizes training error? Which one would it pick?
The answer is that it would pick d = 100. Or, in general, it would
pick d as large as possible. Why? Because choosing a bigger d will
never hurt on the training data. By making d larger, you are simply
encouraging overfitting. But by evaluating on the training data, over-
fitting actually looks like a good idea!
An alternative idea would be to tune the maximum depth on test
data. This is promising because test data peformance is what we
really want to optimize, so tuning this knob on the test data seems
like a good idea. That is, it won’t accidentally reward overfitting. Of
course, it breaks our cardinal rule about test data: that you should
never touch your test data. So that idea is immediately off the table.
However, our “test data” wasn’t magic. We simply took our 1000
examples, called 800 of them “training” data and called the other 200
24 a course in machine learning
“test” data. So instead, let’s do the following. Let’s take our original
1000 data points, and select 700 of them as training data. From the
remainder, take 100 as development data3 and the remaining 200 3
Some people call this “validation
as test data. The job of the development data is to allow us to tune data” or “held-out data.”
1. Split your data into 70% training data, 10% development data and
20% test data.
3. From the above collection of models, choose the one that achieved
the lowest error rate on development data.
4. Evaluate that model on the test data to estimate future test perfor-
mance.
In step 3, you could either choose
the model (trained on the 70% train-
ing data) that did the best on the
1.10 Chapter Summary and Outlook development data. Or you could
? choose the hyperparameter settings
that did best and retrain the model
At this point, you should be able to use decision trees to do machine on the 80% union of training and
learning. Someone will give you data. You’ll split it into training, development data. Is either of these
options obviously better or worse?
development and test portions. Using the training and development
data, you’ll find a good value for maximum depth that trades off
between underfitting and overfitting. You’ll then run the resulting
decision tree model on the test data to get an estimate of how well
you are likely to do in the future.
You might think: why should I read the rest of this book? Aside
from the fact that machine learning is just an awesome fun field to
learn about, there’s a lot left to cover. In the next two chapters, you’ll
learn about two models that have very different inductive biases than
decision trees. You’ll also get to see a very useful way of thinking
about learning: the geometric view of data. This will guide much of
what follows. After that, you’ll learn how to solve problems more
complicated that simple binary classification. (Machine learning
people like binary classification a lot because it’s one of the simplest
non-trivial problems that we can work on.) After that, things will
diverge: you’ll learn about ways to think about learning as a formal
optimization problem, ways to speed up learning, ways to learn
without labeled data (or with very little labeled data) and all sorts of
other fun topics.
decision trees 25
1.11 Exercises
Our brains have evolved to get us out of the rain, find where the Learning Objectives:
berries are, and keep us from getting killed. Our brains did not • Describe a data set as points in a
high dimensional space.
evolve to help us grasp really large numbers or to look at things in
• Explain the curse of dimensionality.
a hundred thousand dimensions. – Ronald Graham
• Compute distances between points
in high dimensional space.
• Implement a K-nearest neighbor
model of learning.
• Draw decision boundaries.
You can think of prediction tasks as mapping inputs (course
• Implement the K-means algorithm
reviews) to outputs (course ratings). As you learned in the previ- for clustering.
ous chapter, decomposing an input into a collection of features (e.g.,
words that occur in the review) forms a useful abstraction for learn-
ing. Therefore, inputs are nothing more than lists of feature values.
This suggests a geometric view of data, where we have one dimen-
sion for every feature. In this view, examples are points in a high-
dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
answers to the key learning questions we discovered in Chapter 1.
" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(2.1)
d =1
2: for n = 1 to N do
7: for k = 1 to K do
The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 2.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 2.8: decision boundary for 1nn.
cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple. What sort of data might yield a
very simple decision boundary with
a decision tree and very complex
2.4 K-Means Clustering ? decision boundary with 1-nearest
neighbor? What about the other
way around?
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the
use of geometric intuitions and data, we are going to temporarily
consider an unsupervised learning problem. In unsupervised learn-
ing, our data consists only of examples xn and does not contain corre-
sponding labels. Your job is to make sense of this data, even though
no one has provided you with correct labels. The particular notion of
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 2.12. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine. Figure 2.12: simple clustering data...
clusters in UL, UR and BC.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 2.12. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest
center. Similarly, if we know the assignment of points to clusters, we
can compute the centers. This introduces a chicken-and-egg problem.
If we knew the clusters, we could compute the centers. If we knew
the centers, we could compute the clusters. But we don’t know either.
The general computer science answer to chicken-and-egg problems
is iteration. We will start with a guess of the cluster centers. Based
on that guess, we will assign each data point to its closest center.
Given these new assignments, we can recompute the cluster centers.
We repeat this process until clusters stop moving. The first few it-
erations of the K-means algorithm are shown in Figure 2.13. In this
example, the clusters converge very quickly.
Algorithm 2.4 spells out the K-means clustering algorithm in de-
tail. The cluster centers are initialized randomly. In line 6, data point
xn is compared against each cluster center µk . It is assigned to cluster
k if k is the center with the smallest distance. (That is the “argmin”
step.) The variable zn stores the assignment (a value from 1 to K) of
example n. In lines 8-12, the cluster centers are re-computed. First, Xk Figure 2.13: first few iterations of
k-means running on previous data set
geometry and nearest neighbors 33
Algorithm 4 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate mean of cluster k
11: end for
12: until µs stop changing
Figure 2.14:
stores all examples that have been assigned to cluster k. The center of
cluster k, µk is then computed as the mean of the points assigned to
it. This process repeats until the means converge.
An obvious question about this algorithm is: does it converge?
A second question is: how long does it take to converge. The first
question is actually easy to answer. Yes, it does. And in practice, it
usually converges quite quickly (usually fewer than 20 iterations). In
Chapter 13, we will actually prove that it converges. The question of
how long it takes to converge is actually a really interesting question.
Even though the K-means algorithm dates back to the mid 1950s, the
best known convergence rates were terrible for a long time. Here, ter-
rible means exponential in the number of data points. This was a sad
situation because empirically we knew that it converged very quickly.
New algorithm analysis techniques called “smoothed analysis” were
invented in 2001 and have been used to show very fast convergence
for K-means (among other algorithms). These techniques are well
beyond the scope of this book (and this author!) but suffice it to say
that K-means is fast in practice and is provably fast in theory.
It is important to note that although K-means is guaranteed to
converge and guaranteed to converge quickly, it is not guaranteed to
converge to the “right answer.” The key problem with unsupervised
learning is that we have no way of knowing what the “right answer”
is. Convergence to a bad solution is usually due to poor initialization.
For example, poor initialization in the data set from before yields
convergence like that seen in Figure ??. As you can see, the algorithm
34 a course in machine learning
think looks like a “round” cluster in two or three dimensions, might 0.06 1.0
0.04
not look so “round” in high dimensions. 0.02
0.8
0.6
0.00
The second strange fact we will consider has to do with the dis- 0.02
0.4
0.04 0.2
tances between points in high dimensions. We start by considering 0.06
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
random points in one dimension. That is, we generate a fake data set 1.0
0.8
consisting of 100 random points between zero and one. We can do 0.6
0.4
0.2
0.0
0.81.0
the same in two dimensions and in three dimensions. See Figure 2.19 0.0 0.2
0.4 0.6 0.40.6
0.8 1.00.00.2
We can actually compute this in closed form (see Exercise ?? for a bit
√
of calculus refresher) and arrive at avgDist( D ) = D/3. Because
we know that the maximum distance between two points grows like
√
D, this says that the ratio between average distance and maximum
distance converges to 1/3.
What is more interesting, however, is the variance of the distribu-
tion of distances. You can show that in D dimensions, the variance
√
is constant 1/ 18, independent of D. This means that when you look
at (variance) divided-by (max distance), the variance behaves like
√
1/ 18D, which means that the effective variance continues to shrink
as D grows 3 . 3
Sergey Brin. Near neighbor search in
When I first saw and re-proved this result, I was skeptical, as I large metric spaces. In Conference on
Very Large Databases (VLDB), 1995
imagine you are. So I implemented it. In Figure 2.20 you can see
the results. This presents a histogram of distances between random 14000 dimensionality versus uniform point distances
2 dims
points in D dimensions for D ∈ {1, 2, 3, 10, 20, 100}. As you can see, 8 dims
√ 12000
32 dims
128 dims
all of these distances begin to concentrate around 0.4 D, even for
# of pairs of points at that distance
You should now be terrified: the only bit of information that KNN 6000
4000
gets is distances. And you’ve just seen that in moderately high di-
2000
mensions, all distances becomes equal. So then isn’t it the case that
0
0.0 0.2 0.4 0.6 0.8 1.0
distance / sqrt(dimensionality)
Thus, nearby points get a vote very close to 1 and far away points get
a vote very close to 0. The overall prediction is positive if the sum
of votes from positive neighbors outweighs the sum of votes from
negative neighbors. Could you combine the e-ball idea
The second issue with KNN is scaling. To predict the label of a with the weighted voting idea?
? Does it make sense, or does one
single test point, we need to find the K nearest neighbors of that idea seem to trump the other?
test point in the training data. With a standard implementation, this
will take O( ND + K log K ) time4 . For very large data sets, this is
impractical.
A first attempt to speed up the computation is to represent each
class by a representative. A natural choice for a representative would
be the mean. We would collapse all positive examples down to their
mean, and all negative examples down to their mean. We could then 4
The ND term comes from computing
distances between the test point and
just run 1-nearest neighbor and check whether a test point is closer
all training points. The K log K term
to the mean of the positive points or the mean of the negative points. comes from finding the K smallest
Figure 2.24 shows an example in which this would probably work values in the list of distances, using a
median-finding algorithm. Of course,
well, and an example in which this would probably work poorly. The ND almost always dominates K log K in
problem is that collapsing each class to its mean is too aggressive. practice.
A less aggressive approach is to make use of the K-means algo-
rithm for clustering. You can cluster the positive examples into L
clusters (we are using L to avoid variable overloading!) and then
cluster the negative examples into L separate clusters. This is shown
in Figure 2.25 with L = 2. Instead of storing the entire data set,
you would only store the means of the L positive clusters and the
means of the L negative clusters. At test time, you would run the
K-nearest neighbors algorithm against these means rather than
against the full training set. This leads to a much faster runtime of
just O( LD + K log K ), which is probably dominated by LD.
ing weights for features. As we’ll see, learning weights for features
amounts to learning a hyperplane classifier: that is, basically a di-
vision of space into two halves by a straight line, where one half is
“positive” and one half is “negative.” In this sense, the perceptron
can be seen as explicitly finding a good linear decision boundary.
Folk biology tells us that our brains are made up of a bunch of little
units, called neurons, that send electrical signals to one another. The
rate of firing tells us how “activated” a neuron is. A single neuron,
like that shown in Figure 3.1 might have three incoming neurons.
These incoming neurons are firing at different rates (i.e., have dif-
ferent activations). Based on how much these incoming neurons are
firing, and how “strong” the neural connections are, our main neu-
ron will “decide” how strongly it wants to fire. And so on through
the whole brain. Learning in the brain happens by neurons becom-
ming connected to other neurons, and the strengths of connections
adapting over time. Figure 3.1: a picture of a neuron
The real biological world is much more complicated than this.
However, our goal isn’t to build a brain, but to simply be inspired
by how they work. We are going to think of our learning algorithm
as a single neuron. It receives input from D-many other neurons,
one for each input feature. The strength of these inputs are the fea-
ture values. This is shown schematically in Figure ??. Each incom-
ing connection has a weight and the neuron simply sums up all the
weighted inputs. Based on this sum, it decides whether to “fire” or
12: return w0 , w1 , . . . , w D , b
So the difference between the old activation a and the new activa-
tion a0 is ∑d xd2 + 1. But xd2 ≥ 0, since it’s squared. So this value is
always at least one. Thus, the new activation is always at least the old
activation plus one. Since this was a positive example, we have suc-
cessfully moved the activation in the proper direction. (Though note
that there’s no guarantee that we will correctly classify this point the
second, third or even fourth time around!) This analysis hold for the case pos-
The only hyperparameter of the perceptron algorithm is MaxIter, itive examples (y = +1). It should
? also hold for negative examples.
the number of passes to make over the training data. If we make Work it out.
many many passes over the training data, then the algorithm is likely
to overfit. (This would be like studying too long for an exam and just
confusing yourself.) On the other hand, going over the data only
one time might lead to underfitting. This is shown experimentally in
Figure 3.3. The x-axis shows the number of passes over the data and
the y-axis shows the training error and the test error. As you can see,
there is a “sweet spot” at which test performance begins to degrade
due to overfitting.
One aspect of the perceptron algorithm that is left underspecified
is line 4, which says: loop over all the training examples. The natural
implementation of this would be to loop over them in a constant
order. The is actually a bad idea.
Figure 3.5:
positive classification.
The decision boundary for a perceptron is a very magical thing. In
D dimensional space, it is always a D − 1-dimensional hyperplane.
(In two dimensions, a 1-d hyperplane is simply a line. In three di-
mensions, a 2-d hyperplane is like a sheet of paper.) This hyperplane
divides space in half. In the rest of this book, we’ll refer to the weight
vector, and to hyperplane it defines, interchangeably.
The perceptron update can also be considered geometrically. (For
simplicity, we will consider the unbiased case.) Consider the situ-
ation in Figure ??. Here, we have a current guess as to the hyper-
plane, and positive training example comes in that is currently mis-
classified. The weights are updated: w ← w + yx. This yields the Figure 3.8: perceptron picture with
new weight vector, also shown in the Figure. In this case, the weight update, no bias
vector changed enough that this training example is now correctly
classified.
TODO
You already have an intuitive feeling for why the perceptron works:
it moves the decision boundary in the direction of the training exam-
ples. A question you should be asking yourself is: does the percep-
tron converge? If so, what does it converge to? And how long does it
take?
It is easy to construct data sets on which the perceptron algorithm
will never converge. In fact, consider the (very uninteresting) learn-
ing problem with no features. You have a data set consisting of one
positive example and one negative example. Since there are no fea-
tures, the only thing the perceptron algorithm will ever do is adjust
the bias. Given this data, you can run the perceptron for a bajillion
iterations and it will never settle down. As long as the bias is non-
negative, the negative example will cause it to decrease. As long as
it is non-positive, the positive example will cause it to increase. Ad
infinitum. (Yes, this is a very contrived example.)
What does it mean for the perceptron to converge? It means that
it can make an entire pass through the training data without making
any more updates. In other words, it has correctly classified every
training example. Geometrically, this means that it was found some
hyperplane that correctly segregates the data into positive and nega-
Figure 3.9: separable data
tive examples, like that shown in Figure 3.9.
In this case, this data is linearly separable. This means that there
46 a course in machine learning
exists some hyperplane that puts all the positive examples on one side
and all the negative examples on the other side. If the training is not
linearly separable, like that shown in Figure 3.10, then the perceptron
has no hope of converging. It could never possibly classify each point
correctly.
The somewhat surprising thing about the perceptron algorithm is
that if the data is linearly separable, then it will converge to a weight
vector that separates the data. (And if the data is inseparable, then it
will never converge.) This is great news. It means that the perceptron
converges whenever it is even remotely possible to converge.
The second question is: how long does it take to converge? By
“how long,” what we really mean is “how many updates?” As is the
case for much learning theory, you will not be able to get an answer
of the form “it will converge after 5293 updates.” This is asking too
much. The sort of answer we can hope to get is of the form “it will
converge after at most 5293 updates.”
What you might expect to see is that the perceptron will con-
verge more quickly for easy learning problems than for hard learning
problems. This certainly fits intuition. The question is how to define
“easy” and “hard” in a meaningful way. One way to make this def-
inition is through the notion of margin. If I give you a data set and
hyperplane that separates it (like that shown in Figure ??) then the
margin is the distance between the hyperplane and the nearest point.
Intuitively, problems with large margins should be easy (there’s lots
of “wiggle room” to find a separating hyperplane); and problems
with small margins should be hard (you really have to get a very
specific well tuned weight vector).
Formally, given a data set D, a weight vector w and bias b, the
margin of w, b on D is defined as:
(
min( x,y)∈D y w · x + b if w separates D
margin(D, w, b) = (3.8)
−∞ otherwise
In words, the margin is only defined if w, b actually separate the data
(otherwise it is just −∞). In the case that it separates the data, we
find the point with the minimum activation, after the activation is
multiplied by the label. So long as the margin is not −∞,
For some historical reason (that is unknown to the author), mar- it is always positive. Geometrically
? this makes sense, but what does
gins are always denoted by the Greek letter γ (gamma). One often Eq (3.8) yeild this?
talks about the margin of a data set. The margin of a data set is the
largest attainable margin on this data. Formally:
In words, to compute the margin of a data set, you “try” every possi-
ble w, b pair. For each pair, you compute its margin. We then take the
the perceptron 47
largest of these as the overall margin of the data.1 If the data is not 1
You can read “sup” as “max” if you
linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical
difference in how the −∞ case is
of the margin, is −∞. handled.
There is a famous theorem due to Rosenblatt2 that shows that the 2
Rosenblatt 1958
number of errors that the perceptron algorithm makes is bounded by
γ−2 . More formally:
Theorem 1 (Perceptron Convergence Theorem). Suppose the perceptron
algorithm is run on a linearly separable data set D with margin γ > 0.
Assume that || x|| ≤ 1 for all x ∈ D. Then the algorithm will converge after
at most γ12 updates.
todo: comment on norm of w and norm of x also some picture
about maximum margins.
The proof of this theorem is elementary, in the sense that it does
not use any fancy tricks: it’s all just algebra. The idea behind the
proof is as follows. If the data is linearly separable with margin γ,
then there exists some weight vector w∗ that achieves this margin.
Obviously we don’t know what w∗ is, but we know it exists. The
perceptron algorithm is trying to find a weight vector w that points
roughly in the same direction as w∗ . (For large γ, “roughly” can be
very rough. For small γ, “roughly” is quite precise.) Every time the
perceptron makes an update, the angle between w and w∗ changes.
What we prove is that the angle actually decreases. We show this in
two steps. First, the dot product w · w∗ increases a lot. Second, the
norm ||w|| does not increase very much. Since the dot product is
increasing, but w isn’t getting too long, the angle between them has
to be shrinking. The rest is algebra.
Thus, the squared norm of w(k) increases by at most one every up-
2
date. Therefore: w(k) ≤ k.
Now we put together the two things we have learned before. By
our first conclusion, we know w∗ · w(k) ≥ kγ. But our second con-
√ 2
clusion, k ≥ w(k) . Finally, because w∗ is a unit vector, we know
that w(k) ≥ w∗ · w(k) . Putting this together, we have:
√
k ≥ w(k) ≥ w∗ · w(k) ≥ kγ (3.16)
√
Taking the left-most and right-most terms, we get that k ≥ kγ.
Dividing both sides by k, we get √1 ≥ γ and therefore k ≤ √1γ .
k
1
This means that once we’ve made γ2
updates, we cannot make any
more!
Perhaps we don’t want to assume
It is important to keep in mind what this proof shows and what that all x have norm at most 1. If
they have all have norm at most
it does not show. It shows that if I give the perceptron data that
? R, you can achieve a very simi-
is linearly separable with margin γ > 0, then the perceptron will lar bound. Modify the perceptron
converge to a solution that separates the data. And it will converge convergence proof to handle this
case.
quickly when γ is large. It does not say anything about the solution,
other than the fact that it separates the data. In particular, the proof
makes use of the maximum margin separator. But the perceptron
is not guaranteed to find this maximum margin separator. The data
may be separable with margin 0.9 and the perceptron might still
find a separating hyperplane with a margin of only 0.000001. Later
(in Chapter ??), we will see algorithms that explicitly try to find the
maximum margin solution. Why does the perceptron conver-
gence bound not contradict the
earlier claim that poorly ordered
3.6 Improved Generalization: Voting and Averaging ? data points (e.g., all positives fol-
lowed by all negatives) will cause
the perceptron to take an astronom-
In the beginning of this chapter, there was a comment that the per-
ically long time to learn?
ceptron works amazingly well. This was a half-truth. The “vanilla”
the perceptron 49
The only difference between the voted prediction, Eq (??), and the
50 a course in machine learning
3.8 Exercises
variants on things you already know. However, before attempting • Explain the relationship between the
three learning techniques you have
to understand more complex models of learning, it is important to seen so far.
have a firm grasp on how to use machine learning in practice. This • Apply several debugging techniques
chapter is all about how to go from an abstract learning problem to learning algorithms.
to a concrete implementation. You will see some examples of “best
practices” along with justifications of these practices.
In many ways, going from an abstract problem to a concrete learn-
ing task is more of an art than a science. However, this art can have Dependencies: Chap-
ter ??,Chapter ??,Chapter ??
a huge impact on the practical performance of learning systems. In
many cases, moving to a more complicated learning algorithm will
gain you a few percent improvement. Going to a better representa-
tion will gain you an order of magnitude improvement. To this end,
we will discuss several high level ideas to help you develop a better
artistic sensibility.
One big difference between learning models is how robust they are to
the addition of noisy or irrelevant features. Intuitively, an irrelevant
feature is one that is completely uncorrelated with the prediction
task. A feature f whose expectation does not depend on the label
E[ f | Y ] = E[ f ] might be irrelevant. For instance, the presence of Figure 4.3: prac:imageshape: object
the word “the” might be largely irrelevant for predicting whether a recognition in shapes
course review is positive or negative.
A secondary issue is how well these algorithms deal with redun-
dant features. Two features are redundant if they are highly cor-
related, regardless of whether they are correlated with the task or
not. For example, having a bright red pixel in an image at position
(20, 93) is probably highly redundant with having a bright red pixel
at position (21, 93). Both might be useful (e.g., for identifying fire hy-
drants), but because of how images are structured, these two features
are likely to co-occur frequently.
When thinking about robustness to irrelevant or redundant fea-
tures, it is usually not worthwhile thinking of the case where one has
999 great features and 1 bad feature. The interesting case is when the
bad features outnumber the good features, and often outnumber by
a large degree. For instance, perhaps the number of good features is
something like log D out of a set of D total features. The question is
how robust are algorithms in this case.1 1
You might think it’s crazy to have so
For shallow decision trees, the model explicitly selects features many irrelevant features, but the cases
you’ve seen so far (bag of words, bag
that are highly correlated with the label. In particular, by limiting the of pixels) are both reasonable examples
depth of the decision tree, one can at least hope that the model will be of this! How many words, out of the
entire English vocabulary (roughly
able to throw away irrelevant features. Redundant features are almost 10, 000 − 100, 000 words), are actually
certainly thrown out: once you select one feature, the second feature useful for predicting positive and
now looks mostly useless. The only possible issue with irrelevant negative course reviews?
of which are random independent coins, the chance that at least one
of them perfectly correlates is 0.5 N −K . This suggests that if we have
a sizeable number K of irrelevant features, we’d better have at least
K + 21 training examples.
Unfortunately, the situation is actually worse than this. In the
above analysis we only considered the case of perfect correlation. We
could also consider the case of partial correlation, which would yield
even higher probabilities. (This is left as Exercise ?? for those who
want some practice with probabilistic analysis.) Suffice it to say that
even decision trees can become confused.
In the case of K-nearest neighbors, the situation is perhaps more
dire. Since KNN weighs each feature just as much as another feature,
the introduction of irrelevant features can completely mess up KNN
prediction. In fact, as you saw, in high dimensional space, randomly
distributed points all look approximately the same distance apart. Figure 4.5: prac:addirel: data from
If we add lots and lots of randomly distributed features to a data high dimensional warning, interpolated
Figure 4.8:
two data sets that differ only in the norm of the feature vectors (i.e.,
one is just a scaled version of the other), it is difficult to compare the
learned models. Example normalization makes this more straightfor-
ward. Moreover, as you saw in the perceptron convergence proof, it is
often just mathematically easier to assume normalized data.
and a set of meta features that you might extract from it. There is a
hyperparameter here of what length paths to extract from the tree: in
this case, only paths of length two are extracted. For bigger trees, or
if you have more data, you might benefit from longer paths.
In addition to combinatorial transformations, the logarithmic
transformation can be quite useful in practice. It seems like a strange
thing to be useful, since it doesn’t seem to fundamentally change
the data. However, since many learning algorithms operate by linear
operations on the features (both perceptron and KNN do this), the
log-transform is a way to get product-like operations. The question is
which of the following feels more applicable to your data: (1) every
time this feature increases by one, I’m equally more likely to predict Figure 4.13: prac:log: performance on
text categ with word counts versus log
a positive label; (2) every time this feature doubles, I’m equally more word counts
like to predict a positive label. In the first case, you should stick
with linear features and in the second case you should switch to
a log-transform. This is an important transformation in text data,
where the presence of the word “excellent” once is a good indicator
of a positive review; seeing “excellent” twice is a better indicator;
but the difference between seeing “excellent” 10 times and seeing it
11 times really isn’t a big deal any more. A log-transform achieves
60 a course in machine learning
So far, our focus has been on classifiers that achieve high accuracy.
In some cases, this is not what you might want. For instance, if you
are trying to predict whether a patient has cancer or not, it might be
better to err on one side (saying they have cancer when they don’t)
than the other (because then they die). Similarly, letting a little spam
slip through might be better than accidentally blocking one email
from your boss.
There are two major types of binary classification problems. One
is “X versus Y.” For instance, positive versus negative sentiment.
Another is “X versus not-X.” For instance, spam versus non-spam.
(The argument being that there are lots of types of non-spam.) Or
in the context of web search, relevant document versus irrelevant
document. This is a subtle and subjective decision. But “X versus not-
X” problems often have more of the feel of “X spotting” rather than
a true distinction between X and Y. (Can you spot the spam? can you
spot the relevant documents?)
For spotting problems (X versus not-X), there are often more ap-
propriate success metrics than accuracy. A very popular one from
information retrieval is the precision/recall metric. Precision asks
the question: of all the X’s that you found, how many of them were
actually X’s? Recall asks: of all the X’s that were out there, how many
of them did you find?4 Formally, precision and recall are defined as: 4
A colleague make the analogy to the
US court system’s saying “Do you
I promise to tell the whole truth and
P= (4.8) nothing but the truth?” In this case, the
S
“whole truth” means high recall and
I
R= (4.9) “nothing but the truth” means high
T precision.”
S = number of Xs that your system found (4.10)
T = number of Xs in the data (4.11)
I = number of correct Xs that your system found (4.12)
(1 + β2 )×P×R
Fβ = (4.14)
β2×P + R
mance for a user who cares β times as much about precision as about
recall.
One thing to keep in mind is that precision and recall (and hence
f-measure) depend crucially on which class is considered the thing
you wish to find. In particular, if you take a binary data set if flip
what it means to be a positive or negative example, you will end
up with completely difference precision and recall values. It is not
the case that precision on the flipped task is equal to recall on the
original task (nor vice versa). Consequently, f-measure is also not the
same. For some tasks where people are less sure about what they
want, they will occasionally report two sets of precision/recall/f-
measure numbers, which vary based on which class is considered the
thing to spot.
There are other standard metrics that are used in different com-
munities. For instance, the medical community is fond of the sensi-
tivity/specificity metric. A sensitive classifier is one which almost
always finds everything it is looking for: it has high recall. In fact,
sensitivity is exactly the same as recall. A specific classifier is one
which does a good job not finding the things that it doesn’t want to
find. Specificity is precision on the negation of the task at hand.
You can compute curves for sensitivity and specificity much like
those for precision and recall. The typical plot, referred to as the re-
ceiver operating characteristic (or ROC curve) plots the sensitivity
against 1 − specificity. Given an ROC curve, you can compute the
area under the curve (or AUC) metric, which also provides a mean-
ingful single number for a system’s performance. Unlike f-measures,
which tend to be low because the require agreement, AUC scores
tend to be very high, even for not great systems. This is because ran-
dom chance will give you an AUC of 0.5 and the best possible AUC
is 1.0.
The main message for evaluation metrics is that you should choose
whichever one makes the most sense. In many cases, several might
make sense. In that case, you should do whatever is more commonly
done in your field. There is no reason to be an outlier without cause.
Algorithm 9 KNN-Train-LOO(D)
1: errk ← 0, ∀1 ≤ k ≤ N − 1 // errk stores how well you do with kNN
2: for n = 1 to N do
14: return argmin err k // return the K that achieved lowest error
k
Algorithm ??.
Overall, the main advantage to cross validation over develop-
ment data is robustness. The main advantage of development data is
speed.
One warning to keep in mind is that the goal of both cross valida-
tion and development data is to estimate how well you will do in the
future. This is a question of statistics, and holds only if your test data
really looks like your training data. That is, it is drawn from the same
distribution. In many practical cases, this is not entirely true.
For example, in person identification, we might try to classify
every pixel in an image based on whether it contains a person or not.
If we have 100 training images, each with 10, 000 pixels, then we have
a total of 1m training examples. The classification for a pixel in image
5 is highly dependent on the classification for a neighboring pixel in
the same image. So if one of those pixels happens to fall in training
data, and the other in development (or cross validation) data, your
model will do unreasonably well. In this case, it is important that
when you cross validate (or use development data), you do so over
images, not over pixels. The same goes for text problems where you
sometimes want to classify things at a word level, but are handed a
collection of documents. The important thing to keep in mind is that
it is the images (or documents) that are drawn independently from
your data distribution and not the pixels (or words), which are drawn
dependently.
practical issues 65
4.9 Exercises
Your boss tells you to build a classifier that can identify fraudulent
transactions in credit card histories. Fortunately, most transactions
are legitimate, so perhaps only 0.1% of the data is a positive in-
stance. The imbalanced data problem refers to the fact that for a
large number of real world problems, the number of positive exam-
ples is dwarfed by the number of negative examples (or vice versa).
This is actually something of a misnomer: it is not the data that is
imbalanced, but the distribution from which the data is drawn. (And
since the distribution is imbalanced, so must the data be.)
Imbalanced data is a problem because machine learning algo-
rithms are too smart for your own good. For most learning algo-
rithms, if you give them data that is 99.9% negative and 0.1% posi-
tive, they will simply learn to always predict negative. Why? Because
beyond binary classification 71
they are trying to minimize error, and they can achieve 0.1% error by
doing nothing! If a teacher told you to study for an exam with 1000
true/false questions and only one of them is true, it is unlikely you
will study very long.
Really, the problem is not with the data, but rather with the way
that you have defined the learning problem. That is to say, what you
care about is not accuracy: you care about something else. If you
want a learning algorithm to do a reasonable job, you have to tell it
what you want!
Most likely, what you want is not to optimize accuracy, but rather
to optimize some other measure, like f-score or AUC. You want your
algorithm to make some positive predictions, and simply prefer those
to be “good.” We will shortly discuss two heuristics for dealing with
this problem: subsampling and weighting. In subsampling, you throw
out some of you negative examples so that you are left with a bal-
anced data set (50% positive, 50% negative). This might scare you
a bit since throwing out data seems like a bad idea, but at least it
makes learning much more efficient. In weighting, instead of throw-
ing out positive examples, we just given them lower weight. If you
assign an importance weight of 0.00101 to each of the positive ex-
amples, then there will be as much weight associated with positive
examples as negative examples.
Before formally defining these heuristics, we need to have a mech-
anism for formally defining supervised learning problems. We will
proceed by example, using binary classification as the canonical
learning problem.
1. An input space X
1. An input space X
This theorem states that if your binary classifier does well (on the
induced distribution), then the learned predictor will also do well
(on the original distribution). Thus, we have successfully converted
a weighted learning problem into a plain classification problem! The
fact that the error rate of the weighted predictor is exactly α times
more than that of the unweighted predictor is unavoidable: the error
metric on which it is evaluated is α times bigger! Why is it unreasonable to expect
The proof of this theorem is so straightforward that we will prove to be able to√achieve, for instance,
? an error of αe, or anything that is
it here. It simply involves some algebra on expected values. sublinear in α?
= ∑ ∑ D w ( x, y)αy=1 f ( x) 6= y
(5.2)
x∈X y∈±1
1
∑ D w ( x, +1) f ( x) 6= +1 + D w ( x, −1) f ( x) 6= −1
=α
x∈X
α
(5.3)
∑ D b ( x, +1) f ( x) 6= +1 + D b ( x, −1) f ( x) 6= −1
=α (5.4)
x∈X
= αE(x,y)∼D b f ( x) 6= y (5.5)
= αeb (5.6)
5: return f 1 , . . . , f K
lem. A very common approach is the one versus all technique (also
called OVA or one versus rest). To perform OVA, you train K-many
binary classifiers, f 1 , . . . , f K . Each classifier sees all of the training
data. Classifier f i receives all examples labeled class i as positives
and all other examples as negatives. At test time, whichever classifier
predicts “positive” wins, with ties broken randomly. Suppose that you have N data
The training and test algorithms for OVA are sketched in Algo- points in K classes, evenly divided.
How long does it take to train an
rithms 5.2 and 5.2. In the testing procedure, the prediction of the ith
? OVA classifier, if the base binary
classifier is added to the overall score for class i. Thus, if the predic- classifier takes O( N ) time to train?
tion is positive, class i gets a vote; if the prdiction is negative, every- What if the base classifier takes
O( N 2 ) time?
one else (implicitly) gets a vote. (In fact, if your learning algorithm
can output a confidence, as discussed in Section ??, you can often do
better by using the confidence as y, rather than a simple ±1.)
OVA is very natural, easy to implement, and quite natural. It also
works very well in practice, so long as you do a good job choosing
a good binary classification algorithm tuning its hyperparameters Why would using a confidence
well. Its weakness is that it can be somewhat brittle. Intuitively, it is ? help.
not particularly robust to errors in the underlying classifiers. If one
classifier makes a mistake, it eis possible that the entire prediction is
erroneous. In fact, it is entirely possible that none of the K classifiers
predicts positive (which is actually the worst-case scenario from a
theoretical perspective)! This is made explicit in the OVA error bound
below.
Theorem 3 (OVA Error Bound). Suppose the average binary error of the
K binary classifiers is e. Then the error rate of the OVA multiclass predictor
is at most (K − 1)e.
2: for i = 1 to K-1 do
the (K2 ) binary classifiers is e. Then the error rate of the AVA multiclass
predictor is at most 2(K − 1)e.
The bound for AVA is 2(K − 1)e; the
At this point, you might be wondering if it’s possible to do bet- bound for OVA is (K − 1)e. Does
ter than something linear in K. Fortunately, the answer is yes! The ? this mean that OVA is necessarily
better than AVA? Why or why not?
solution, like so much in computer science, is divide and conquer.
The idea is to construct a binary tree of classifiers. The leaves of this
tree correspond to the K labels. Since there are only log2 K decisions
made to get from the root to a leaf, then there are only log2 K chances
to make an error.
An example of a classification tree for K = 8 classes is shown in
Figure 5.2. At the root, you distinguish between classes {1, 2, 3, 4}
and classes {5, 6, 7, 8}. This means that you will train a binary clas-
sifier whose positive examples are all data points with multiclass
label {1, 2, 3, 4} and whose negative examples are all data points with
multiclass label {5, 6, 7, 8}. Based on what decision is made by this
classifier, you can walk down the appropriate path in the tree. When
K is not a powwr of 2, the tree will not be full. This classification tree
algorithm achieves the following bound.
the path from the root to the correct leaf makes an error. Each has
probability e of making an error and the path consists of at most
dlog2 K e binary decisions.
One think to keep in mind with tree classifiers is that you have
control over how the tree is defined. In OVA and AVA you have no
say in what classification problems are created. In tree classifiers,
the only thing that matters is that, at the root, half of the classes are
considered positive and half are considered negative. You want to
split the classes in such a way that this classification decision is as
easy as possible. You can use whatever you happen to know about
your classification problem to try to separate the classes out in a
reasonable way.
Can you do better than dlog2 K e e? It turns out the answer is yes,
but the algorithms to do so are relatively complicated. You can actu-
ally do as well as 2e using the idea of error-correcting tournaments.
Moreover, you can prove a lower bound that states that the best you
could possible do is e/2. This means that error-correcting tourna-
ments are at most a factor of four worse than optimal.
5.3 Ranking
You start a new web search company called Goohooing. Like other
search engines, a user inputs a query and a set of documents is re-
trieved. Your goal is to rank the resulting documents based on rel-
evance to the query. The ranking problem is to take a collection of
items and sort them according to some notion of preference. One of
the trickiest parts of doing ranking through learning is to properly
define the loss function. Toward the end of this section you will see a
very general loss function, but before that let’s consider a few special
cases.
Continuing the web search example, you are given a collection of
queries. For each query, you are also given a collection of documents,
together with a desired ranking over those documents. In the follow-
ing, we’ll assume that you have N-many queries and for each query
you have M-many documents. (In practice, M will probably vary
by query, but for ease we’ll consider the simplified case.) The goal is
to train a binary classifier to predict a preference function. Given a
query q and two documents di and d j , the classifier should predict
whether di should be preferred to d j with respect to the query q.
As in all the previous examples, there are two things we have to
take care of: (1) how to train the classifier that predicts preferences;
(2) how to turn the predicted preferences into a ranking. Unlike the
previous examples, the second step is somewhat complicated in the
beyond binary classification 79
TASK : ω -R ANKING
Given:
1. An input space X
where σ̂ = f ( x)
In this definition, the only complex aspect is the loss function 5.7.
This loss sums over all pairs of objects u and v. If the true ranking (σ)
beyond binary classification 81
You are writing new software for a digital camera that does face
identification. However, instead of simply finding a bounding box
around faces in an image, you must predict where a face is at the
pixel level. So your input is an image (say, 100×100 pixels: this is a
really low resolution camera!) and your output is a set of 100×100
Figure 5.3: example face finding image
binary predictions about each pixel. You are given a large collection and pixel mask
beyond binary classification 83
Given:
Algorithm 21 StackTest( f 1 , . . . , f K , G)
1: Ŷk,v ← 0, ∀ k ∈ [ K ], v ∈ G // initialize predictions for all levels
2: for k = 1 to K do
3: for all v ∈ G do
4: x ← features for node v
5: x ← x ⊕ Ŷl,u , ∀u ∈ N (u), ∀l ∈ [k − 1] // add on features for
6: // neighboring nodes from lower levels in the stack
7: Ŷk,v ← f k (x) // predict according to kth level
8: end for
9: end for
10: return {ŶK,v : v ∈ G } // return predictions for every node from the last layer
5.5 Exercises
The essence of mathematics is not to make simple things compli- Learning Objectives:
cated, but to make complicated things simple. – Stanley Gudder • Define and plot four surrogate loss
functions: squared loss, logistic loss,
exponential loss and hinge loss.
• Compare and contrast the optimiza-
tion of 0/1 loss and surrogate loss
functions.
• Solve the optimization problem
for squared loss with a quadratic
regularizer in closed form.
In Chapter ??, you learned about the perceptron algorithm
• Implement and debug gradient
for linear classification. This was both a model (linear classifier) and
descent and subgradient descent.
algorithm (the perceptron update rule) in one. In this section, we
will separate these two, and consider general ways for optimizing
linear models. This will lead us into some aspects of optimization
(aka mathematical programming), but not very far. At the end of
this chapter, there are pointers to more literature on optimization for
those who are interested.
The basic idea of the perceptron is to run a particular algorithm
until a linear separator is found. You might ask: are there better al-
gorithms for finding such a linear separator? We will follow this idea
and formulate a learning problem as an explicit optimization prob-
Dependencies:
lem: find me a linear separator that is not too complicated. We will
see that finding an “optimal” separator is actually computationally
prohibitive, and so will need to “relax” the optimality requirement.
This will lead us to a convex objective that combines a loss func-
tion (how well are we doing on the training data?) and a regularizer
(how complicated is our learned model?). This learning framework
is known as both Tikhonov regularization and structural risk mini-
mization.
rable. In the case that your training data isn’t linearly separable, you
might want to find the hyperplane that makes the fewest errors on
the training data. We can write this down as a formal mathematics
optimization problem as follows:
min
w,b
∑ 1[ y n ( w · x n + b ) > 0] (6.1)
n
min
w,b
∑ 1[yn (w · xn + b) > 0] + λR(w, b) (6.2)
n
1
In the definition of logistic loss, the log 2 term out front is there sim-
ply to ensure that `(log) (y, 0) = 1. This ensures, like all the other
surrogate loss functions, that logistic loss upper bounds the zero/one
loss. (In practice, people typically omit this constant since it does not
affect the optimization.)
There are two big differences in these loss functions. The first
difference is how “upset” they get by erroneous predictions. In the
90 a course in machine learning
case of hinge loss and logistic loss, the growth of the function as ŷ
goes negative is linear. For squared loss and exponential loss, it is
super-linear. This means that exponential loss would rather get a few
examples a little wrong than one example really wrong. The other
difference is how they deal with very confident correct predictions.
Once yŷ > 1, hinge loss does not care any more, but logistic and
exponential still think you can do better. On the other hand, squared
loss thinks it’s just as bad to predict +3 on a positive example as it is
to predict −1 on a positive example.
min
w,b
∑ `(yn , w · xn + b) + λR(w, b) (6.8)
n
||w|| p = ∑ | wd | p
(6.10)
d
You can check that the 2-norm exactly corresponds to the usual Eu-
clidean norm, and that the 1-norm corresponds to the “absolute”
regularizer described above. You can actually identify the R(cnt)
When p-norms are used to regularize weight vectors, the interest- regularizer with a p-norm as well.
? Which value of p gives it to you?
ing aspect is how they trade-off multiple features. To see the behavior
(Hint: you may have to take a limit.)
of p-norms in two dimensions, we can plot their contour (or level-
set). Figure 6.5 shows the contours for the same p norms in two
dimensions. Each line denotes the two-dimensional vectors to which
this norm assignes a total value of 1. By changing the value of p, you
can interpolate between a square (the so-called “max norm”), down
to a circle (2-norm), diamond (1-norm) and pointy-star-shaped-thing
(p < 1 norm).
In general, smaller values of p “prefer” sparser vectors. You can
see this by noticing that the contours of small p-norms “stretch”
out along the axes. It is for this reason that small p-norms tend to
yield weight vectors with many zero entries (aka sparse weight vec-
tors). Unfortunately, for p < 1 the norm becomes non-convex. As
you might guess, this means that the 1-norm is a popular choice for
sparsity-seeking applications.
Figure 6.6:
Algorithm 22 GradientDescent(F , K, η1 , . . . )
1: z(0) ← h0, 0, . . . , 0i // initialize variable we are optimizing
2: for k = 1 . . . K do
3: g (k) ← ∇z F |z(k-1) // compute gradient at current location
4: z(k) ← z(k-1) − η (k) g (k) // take a step down the gradient
5: end for
6: return z(K)
∂L ∂ ∂ λ
∑ exp ||w||2
= − yn (w · xn + b) + (6.12)
∂b ∂b n ∂b 2
∂
=∑
exp − yn (w · xn + b) + 0 (6.13)
n ∂b
∂
=∑
− yn (w · xn + b) exp − yn (w · xn + b) (6.14)
n ∂b
= − ∑ yn exp − yn (w · xn + b)
(6.15)
n
Now you can repeat the previous exercise. The update is of the form
w ← w − η ∇w L. For well classified points (ones that are tend
toward yn ∞), the gradient is near zero. For poorly classified points,
94 a course in machine learning
Figure 6.9:
If you plug this subgradient form into Algorithm 6.4, you obtain
Algorithm 6.5. This is the subgradient descent for regularized hinge
loss (with a 2-norm regularizer).
1
says that we should minimize ∑n (Ŷn − Yn )2 , which can be written
2
2
in vector form as a minimization of 12 Ŷ − Y . Verify that the squared error can
This can be expanded visually as: ? actually be written as this vector
norm.
x1,1 x1,2 . . . x1,D w1 ∑d x1,d wd y1
x2,1 x2,2 . . . x2,D w2 ∑d x2,d wd y2
.. .. .. .. = .. ≈ ..
..
. . . . . . .
x N,1 x N,2 . . . x N,D wD ∑d x N,d wd yN
| {z }| {z } | {z } | {z }
X w Ŷ Ŷ
(6.27)
At the beginning of this chapter, you may have looked at the convex
surrogate loss functions and asked yourself: where did these come
from?! They are all derived from different underlying principles,
which essentially correspond to different inductive biases.
Let’s start by thinking back to the original goal of linear classifiers:
to find a hyperplane that separates the positive training examples
from the negative ones. Figure 6.10 shows some data and three po-
tential hyperplanes: red, green and blue. Which one do you like best?
Most likely you chose the green hyperplane. And most likely you Figure 6.10: picture of data points with
chose it because it was furthest away from the closest training points. three hyperplanes, RGB with G the best
In other words, it had a large margin. The desire for hyperplanes
with large margins is a perfect example of an inductive bias. The data
does not tell us which of the three hyperplanes is best: we have to
choose one using some other source of information.
Following this line of thinking leads us to the support vector ma-
chine (SVM). This is simply a way of setting up an optimization
problem that attempts to find a separating hyperplane with as large
a margin as possible. It is written as a constrained optimization
problem:
1
min (6.35)
w,b γ(w, b)
subj. to yn (w · xn + b) ≥ 1 (∀n)
subject to the constraint that all training examples are correctly classi-
fied.
The “odd” thing about this optimization problem is that we re-
quire the classification of each point to be greater than one rather than
simply greater than zero. However, the problem doesn’t fundamen-
tally change if you replace the “1” with any other positive constant
(see Exercise ??). As shown in Figure 6.11, the constant one can be
interpreted visually as ensuring that there is a non-trivial margin
between the positive points and negative points.
The difficulty with the optimization problem in Eq (??) is what
happens with data that is not linearly separable. In that case, there
is no set of parameters w, b that can simultaneously satisfy all the Figure 6.11: hyperplane with margins
constraints. In optimization terms, you would say that the feasible on sides
region is empty. (The feasible region is simply the set of all parame-
ters that satify the constraints.) For this reason, this is refered to as
the hard-margin SVM, because enforcing the margin is a hard con-
straint. The question is: how to modify this optimization problem so
that it can handle inseparable data.
The key idea is the use of slack parameters. The intuition behind
slack parameters is the following. Suppose we find a set of param-
eters w, b that do a really good job on 9999 data points. The points
are perfectly classifed and you achieve a large margin. But there’s
one pesky data point left that cannot be put on the proper side of the
margin: perhaps it is noisy. (See Figure 6.12.) You want to be able
to pretend that you can “move” that point across the hyperplane on Figure 6.12: one bad point with slack
to the proper side. You will have to pay a little bit to do so, but as
long as you aren’t moving a lot of points around, it should be a good
idea to do this. In this picture, the amount that you move the point is
denoted ξ (xi).
By introducing one slack parameter for each training example,
and penalizing yourself for having to use slack, you can create an
objective function like the following, soft-margin SVM:
1
min + C ∑ ξn (6.36)
w,b,ξ γ(w, b) n
| {z } | {z }
large margin small slack
subj. to yn (w · xn + b) ≥ 1 − ξ n (∀n)
ξn ≥ 0 (∀n)
The goal of this objective function is to ensure that all points are
correctly classified (the first constraint). But if a point n cannot be
correctly classified, then you can set the slack ξ n to something greater
than zero to “move” it in the correct direction. However, for all non-
zero slacks, you have to pay in the objective function proportional to
the amount of slack. The hyperparameter C > 0 controls overfitting
100 a course in machine learning
versus underfitting. The second constraint simply says that you must
not have negative slack. What values of C will lead to over-
One major advantage of the soft-margin SVM over the original ? fitting? What values will lead to
underfitting?
hard-margin SVM is that the feasible region is never empty. That is,
there is always going to be some solution, regardless of whether your
training data is linearly separable or not.
It’s one thing to write down an optimization problem. It’s another
thing to try to solve it. There are a very large number of ways to
optimize SVMs, essentially because they are such a popular learning
model. Here, we will talk just about one, very simple way. More
complex methods will be discussed later in this book once you have a Suppose I give you a data set.
Without even looking at the data,
bit more background. construct for me a feasible solution
To make progress, you need to be able to measure the size of the ? to the soft-margin SVM. What is
the value of the objective for this
margin. Suppose someone gives you parameters w, b that optimize solution?
the hard-margin SVM. We wish to measure the size of the margin.
The first observation is that the hyperplane will lie exactly halfway
between the nearest positive point and nearest negative point. If not,
the margin could be made bigger by simply sliding it one way or the
other by adjusting the bias b.
By this observation, there is some positive example that that lies
exactly 1 unit from the hyperplane. Call it x+ , so that w · x+ + b = 1.
Similarly, there is some negative example, x− , that lies exactly on
the other side of the margin: for which w · x− + b = −1. These two
points, x+ and x− give us a way to measure the size of the margin.
As shown in Figure 6.11, we can measure the size of the margin by
looking at the difference between the lengths of projections of x+
and x− onto the hyperplane. Since projection requires a normalized
vector, we can measure the distances as:
1
d+ = w · x+ + b − 1 (6.37)
|| ||
w
1
d− = − w · x− − b + 1 (6.38) Figure 6.13: copy of figure from p5 of
||w|| cs544 svm tutorial
1 +
d − d−
γ= (6.39)
2
1 1 + 1 −
= w·x +b−1− w·x −b+1 (6.40)
2 ||w|| ||w||
1 1 1
= w · x+ − w · x− (6.41)
2 ||w|| || ||
w
1 1 1
= (+1) − (−1) (6.42)
2 ||w|| ||w||
1
= (6.43)
||w||
linear models 101
6.8 Exercises
Figure 7.1:
The Bayes error rate (or Bayes optimal error rate) is the error rate
of the Bayes optimal classifier. It is the best error rate you can ever
hope to achieve on this classification problem (under zero/one loss).
The take-home message is that if someone gave you access to
the data distribution, forming an optimal classifier would be trivial.
Unfortunately, no one gave you this distribution, but this analysis
suggests that good way to build a classifier is to try to estimate D . In
other words, you try to learn a distribution D̂ , which you hope to
very similar to D , and then use this distribution for classification. Just
as in the preceding chapters, you can try to form your estimate of D
based on a finite training set.
The most direct way that you can attempt to construct such a
probability distribution is to select a family of parametric distribu-
tions. For instance, a Gaussian (or Normal) distribution is parametric:
it’s parameters are its mean and covariance. The job of learning is
then to infer which parameters are “best” as far as the observed train-
ing data is concerned, as well as whatever inductive bias you bring.
A key assumption that you will need to make is that the training data
you have access to is drawn independently from D . In particular, as
you draw examples ( x1 , y1 ) ∼ D then ( x2 , y2 ) ∼ D and so on, the
nth draw ( xn , yn ) is drawn from D and does not otherwise depend
probabilistic modeling 105
Suppose you need to model a coin that is possibly biased (you can
think of this as modeling the label in a binary classification problem),
and that you observe data HHTH (where H means a flip came up heads
and T means it came up tails). You can assume that all the flips came
from the same coin, and that each flip was independent (hence, the
data was i.i.d.). Further, you may choose to believe that the coin has
a fixed probability β of coming up heads (and hence 1 − β of coming
up tails). Thus, the parameter of your model is simply the scalar β. Describe a case in which at least
The most basic computation you might perform is maximum like- ? one of the assumptions we are
making about the coin flip is false.
lihood estimation: namely, select the paramter β the maximizes the
probability of the data under that parameter. In order to do so, you
need to compute the probability of the data:
∂ h 3 i
β − β4 = 3β2 − 4β3 (7.7)
∂β
4β3 = 3β2 (7.8)
⇐⇒4β = 3 (7.9)
3
⇐⇒ β = (7.10)
4
Thus, the maximum likelihood β is 0.75, which is probably what
you would have selected by intuition. You can solve this problem
more generally as follows. If you have H-many heads and T-many
tails, the probability of your data sequence is β H (1 − β) T . You can
try to take the derivative of this with respect to β and follow the
same recipe, but all of the products make things difficult. A more
106 a course in machine learning
Now, consider the binary classification problem. You are looking for
a parameterized probability distribution that can describe the training
data you have. To be concrete, your task might be to predict whether
a movie review is positive or negative (label) based on what words
(features) appear in that review. Thus, the probability for a single data
point can be written as:
pθ ( x1 , x2 , . . . , x D , y) = pθ (y) pθ ( x1 | y) pθ ( x2 | y, x1 ) pθ ( x3 | y, x1 , x2 )
· · · pθ ( x D | y, x1 , x2 , . . . , x D−1 ) (7.14)
= pθ (y) ∏ pθ ( xd | y, x1 , . . . , xd−1 ) (7.15)
d
don’t know the label—they most certainly are not.) Formally this
assumption states that:
Solving for θ0 is identical to solving for the biased coin case from
before: it is just the relative frequency of positive labels in your data
(because θ0 doesn’t depend on x at all). For the other parameters,
you can repeat the same exercise as before for each of the 2D coins
independently. This yields:
1
θ̂0 =
N ∑ [ y n = +1] (7.20)
n
∑n [yn = +1 ∧ xn,d = 1]
θ̂(+1),d = (7.21)
∑ n [ y n = +1]
∑n [yn = −1 ∧ xn,d = 1]
θ̂(−1),d = (7.22)
∑ n [ y n = −1]
In the case that the features are not binary, you need to choose a dif-
ferent model for p( xd | y). The model we chose here is the Bernouilli
distribution, which is effectively a distribution over independent
probabilistic modeling 109
Figure 7.2:
coin flips. For other types of data, other distributions become more
appropriate. The die example from before corresponds to a discrete
distribution. If the data is continuous, you might choose to use a
Gaussian distribution (aka Normal distribution). The choice of dis-
tribution is a form of inductive bias by which you can inject your
knowledge of the problem into the learning algorithm.
7.4 Prediction
Consider the predictions made by the naive Bayes model with Bernoulli
features in Eq (7.18). You can better understand this model by con-
sidering its decision boundary. In the case of probabilistic models,
the decision boundary is the set of inputs for which the likelihood of
y = +1 is precisely 50%. Or, in other words, the set of inputs x for
which p(y = +1 | x)/p(y = −1 | x) = 1. In order to do this, the
first thing to notice is that p(y | x) = p(y, x)/p( x). In the ratio, the
p( x) terms cancel, leaving p(y = +1, x)/p(y = −1, x). Instead of
computing this ratio, it is easier to compute the log-likelihood ratio
(or LLR), log p(y = +1, x) − log p(y = −1, x), computed below:
" # " #
LLR = log θ0 ∏ − log (1 − θ0 ) ∏
[ x =1] [ x =1]
θ(+d1),d (1 − θ(+1),d )[ xd =0] θ(−d1),d (1 − θ(−1),d )[ xd =0] model assumptions
d d
(7.23)
= log θ0 − log(1 − θ0 ) + ∑[ xd = 1] log θ(+1),d − log θ(−1),d
d
+ ∑[ xd = 0] log(1 − θ(+1),d ) − log(1 − θ(−1),d ) take logs and rearrange
d
(7.24)
θ(+1),d 1 − θ(+1),d θ0
= ∑ xd log + ∑(1 − xd ) log + log simplify log terms
d
θ(−1),d d
1 − θ(−1),d 1 − θ0
(7.25)
" #
θ(+1),d 1 − θ(+1),d 1 − θ(+1),d θ0
= ∑ xd log − log + ∑ log + log group x-terms
d
θ(−1),d 1 − θ(−1),d d
1 − θ(−1),d 1 − θ0
(7.26)
= x·w+b (7.27)
110 a course in machine learning
The result of the algebra is that the naive Bayes model has precisely
the form of a linear model! Thus, like perceptron and many of the
other models you’ve previous studied, the decision boundary is
linear.
TODO: MBR
you can take a derivative with respect to, say µk,i and obtain:
∂ log p( D ) ∂ 1
= −∑∑ 2 ( xn,d − µyn ,d )2 ignore irrelevant terms
∂µk,i ∂µk,i n d 2σy n ,d
(7.31)
∂ 1
= − ∑ 2
( xn,i − µk,i )2 ignore irrelevant terms
∂µk,i n:y =k 2σk,d
n
(7.32)
1
= ∑ 2
σk,d
( xn,i − µk,i ) take derivative
n:yn =k
(7.33)
∑n:yn =k xn,i
µk,i = (7.34)
∑n:yn =k 1
Namely, the sample mean of the ith feature of the data points that fall
2 yields:
in class k. A similar analysis for σk,i
" #
∂ log p( D ) ∂ 1 1
2
= 2 − ∑ 2
log(σk,i ) + 2 ( xn,i − µk,i ) 2
ignore irrelevant terms
∂σk,i ∂σk,i y:yn =k 2 2σk,i
(7.35)
" #
1 1
=− ∑ 2
2σk,i
− 2 )2
2(σk,i
( xn,i − µk,i )2 take derivative
y:yn =k
(7.36)
1 h i
= 4
2σk,i
∑ ( xn,i − µk, i )2 − σk,i
2
simplify
y:yn =k
(7.37)
You can now set this equal to zero and solve, yielding:
Which is just the sample variance of feature i for class k. What would the estimate be if you
decided that, for a given class k, all
features had equal variance? What
7.6 Conditional Models ? if you assumed feature i had equal
variance for each class? Under what
circumstances might it be a good
In the foregoing examples, the task was formulated as attempting to
idea to make such assumptions?
model the joint distribution of ( x, y) pairs. This may seem wasteful:
at prediction time, all you care about is p(y | x), so why not model it
directly?
Starting with the case of regression is actually somewhat simpler
than starting with classification in this case. Suppose you “believe”
112 a course in machine learning
that the relationship between the real value y and the vector x should
be linear. That is, you expect that y = w · x + b should hold for some
parameters (w, b). Of course, the data that you get does not exactly
obey this: that’s fine, you can think of deviations from y = w · x +
b as noise. To form a probabilistic model, you must assume some
distribution over noise; a convenient choice is zero-mean Gaussian
noise. This leads to the following generative story:
(a) Compute tn = w · xn + b
(b) Choose noise en ∼ Nor(0, σ2 )
(c) Return yn = tn + en
Reading off the log likelihood of a dataset from this generative story,
you obtain:
1 1
log p( D ) = ∑ − log(σ ) − 2 (w · xn + b − yn )
2 2
model assumptions
n 2 2σ
(7.39)
1
=−
2σ2 ∑ (w · xn + b − yn )2 + const remove constants
n
(7.40)
1 exp z
Logistic function: σ(z) = = (7.41)
1 + exp[−z] 1 + exp z
The logistic function has several nice properties that you can verify
for yourself: σ (−z) = 1 − σ (z) and ∂σ/∂z = zσ2 (z).
Using the logistic function, you can write down a generative story
for binary classification:
(a) Compute tn = σ (w · xn + b)
(b) Compute zn ∼ Ber(tn )
(c) Return yn = 2zn − 1 (to make it ±1) Figure 7.4: sketch of logistic function
for the bias of the coin is 100%: it will always come up heads. This is
true even if you had only flipped it once! If course if you had flipped
it one million times and it had come up heads every time, then you
might find this to be a reasonable solution.
This is clearly undesirable behavior, especially since data is expen-
sive in a machine learning setting. One solution (there are others!) is
to seek parameters that balance a tradeoff between the likelihood of
the data and some prior belief you have about what values of those
parameters are likely. Taking the case of the logistic regression, you
might a priori believe that small values of w are more likely than
large values, and choose to represent this as a Gaussian prior on each
component of w.
The maximum a posteriori principle is a method for incoporat-
ing both data and prior beliefs to obtain a more balanced parameter
estimate. In abstract terms, consider a probabilistic model over data
D that is parameterized by parameters θ. If you think of the pa-
rameters as just another random variable, then you can write this
model as p( D | θ ), and maximum likelihood amounts to choosing θ
to maximize p( D | θ ). However, you might instead with to maximize
the probability of the parameters, given the data. Namely, maximize
p(θ | D ). This term is known as the posterior distribution on θ, and
can be computed by Bayes’ rule:
prior likelihood
z}|{ z }| {
p(θ ) p( D | θ )
Z
p(θ | D ) = , where p( D ) = dθ p(θ ) p( D | θ )
| {z } p( D )
posterior | {z }
evidence
(7.46)
This reads: the posterior is equal to the prior times the likelihood di-
vided by the evidence.2 The evidence is a scary-looking term (it has 2
The evidence is sometimes called the
an integral!) but note that from the perspective of seeking parameters marginal likelihood.
θ than maximize the posterior, the evidence is just a constant (it does
not depend on θ) and therefore can be ignored.
Returning to the logistic regression example with Gaussian priors
on the weights, the log posterior looks like:
1 2
log p(θ | D ) = − ∑ `(log) (yn , w · xn + b) − ∑ w + const model definition
n d
2σ2 d
(7.47)
1
= − ∑ `(log) (yn , w · xn + b) − ||w||2 (7.48)
n 2σ2
7.8 Exercises
the inputs as a real layer. That is, it’s two layers of trained weights.)
Prediction with a neural network is a straightforward generaliza-
tion of prediction with a perceptron. First you compute activations
of the nodes in the hidden unit based on the inputs and the input
weights. Then you compute activations of the output unit given the
hidden unit activations and the second layer of weights.
The only major difference between this computation and the per-
ceptron computation is that the hidden units compute a non-linear
function of their inputs. This is usually called the activation function
or link function. More formally, if wi,d is the weights on the edge
connecting input d to hidden unit i, then the activation of hidden unit
i is computed as:
hi = f ( wi · x ) (8.1)
Where the second line is short hand assuming that tanh can take a
vector as input and product a vector as output. Is it necessary to use a link function
at all? What would happen if you
? just used the identify function as a
link?
118 a course in machine learning
You can solve this problem using a two layer network with two
hidden units. The key idea is to make the first hidden unit compute
an “or” function: x1 ∨ x2 . The second hidden unit can compute an
“and” function: x1 ∧ x2 . The the output can combine these into a
single prediction that mimics XOR. Once you have the first hidden
unit activate for “or” and the second for “and,” you need only set the
output weights as −2 and +1, respectively. Verify that these output weights
To achieve the “or” behavior, you can start by setting the bias to ? will actually give you XOR.
−0.5 and the weights for the two “real” features as both being 1. You
can check for yourself that this will do the “right thing” if the link
function were the sign function. Of course it’s not, it’s tanh. To get
tanh to mimic sign, you need to make the dot product either really
really large or really really small. You can accomplish this by set-
ting the bias to −500, 000 and both of the two weights to 1, 000, 000.
Now, the activation of this unit will be just slightly above −1 for
x = h−1, −1i and just slightly below +1 for the other three examples. This shows how to create an “or”
At this point you’ve seen that one-layer networks (aka percep- ? function. How can you create an
“and” function?
trons) can represent any linear function and only linear functions.
You’ve also seen that two-layer networks can represent non-linear
functions like XOR. A natural question is: do you get additional
representational power by moving beyond two layers? The answer
is partially provided in the following Theorem, due originally to
George Cybenko for one particular type of link function, and ex-
tended later by Kurt Hornik to arbitrary link functions.
function.”
This is a remarkable theorem. Practically, it says that if you give
me a function F and some error tolerance parameter e, I can construct
a two layer network that computes F. In a sense, it says that going
from one layer to two layers completely changes the representational
capacity of your model.
When working with two-layer networks, the key question is: how
many hidden units should I have? If your data is D dimensional
and you have K hidden units, then the total number of parameters
is ( D + 2)K. (The first +1 is from the bias, the second is from the
second layer of weights.) Following on from the heuristic that you
should have one to two examples for each parameter you are trying
to estimate, this suggests a method for choosing the number of hid-
den units as roughly b N D c. In other words, if you have tons and tons
of examples, you can safely have lots of hidden units. If you only
have a few examples, you should probably restrict the number of
hidden units in your network.
The number of units is both a form of inductive bias and a form
of regularization. In both view, the number of hidden units controls
how complex your function will be. Lots of hidden units ⇒ very
complicated function. Figure ?? shows training and test error for
neural networks trained with different numbers of hidden units. As
the number increases, training performance continues to get better.
But at some point, test performance gets worse because the network
has overfit the data.
More specifically, the set up is exactly the same as before. You are
going to optimize the weights in the network to minimize some ob-
jective function. The only difference is that the predictor is no longer
linear (i.e., ŷ = w · x + b) but now non-linear (i.e., v · tanh(Wx̂)).
The only question is how to do gradient descent on this more compli-
cated objective.
For now, we will ignore the idea of regularization. This is for two
reasons. The first is that you already know how to deal with regular-
ization, so everything you’ve learned before applies. The second is
that historically, neural networks have not been regularized. Instead,
120 a course in machine learning
Putting this together, we get that the gradient with respect to wi is:
Intuitively you can make sense of this. If the overall error of the
predictor (e) is small, you want to make small steps. If vi is small
for hidden unit i, then this means that the output is not particularly
sensitive to the activation of the ith hidden unit. Thus, its gradient
should be small. If vi flips sign, the gradient at wi should also flip
signs. The name back-propagation comes from the fact that you
propagate gradients backward through the network, starting at the
end.
The complete instantiation of gradient descent for a two layer
network with K hidden units is sketched in Algorithm 8.2. Note that
this really is exactly a gradient descent algorithm; the only different is
that the computation of the gradients of the input layer is moderately
complicated. What would happen to this algo-
As a bit of practical advice, implementing the back-propagation rithm if you wanted to optimize
algorithm can be a bit tricky. Sign errors often abound. A useful trick ? exponential loss instead of squared
error? What if you wanted to add in
is first to keep W fixed and work on just training v. Then keep v weight regularization?
fixed and work on training W. Then put them together.
Based on what you know about linear models, you might be tempted
to initialize all the weights in a neural network to zero. You might
also have noticed that in Algorithm ??, this is not what’s done:
they’re initialized to small random values. The question is why?
The answer is because an initialization of W = 0 and v = 0 will
lead to “uninteresting” solutions. In other words, if you initialize the
model in this way, it will eventually get stuck in a bad local optimum.
To see this, first realize that on any example x, the activation hi of the
hidden units will all be zero since W = 0. This means that on the first
iteration, the gradient on the output weights (v) will be zero, so they
will stay put. Furthermore, the gradient w1,d for the dth feature on
the ith unit will be exactly the same as the gradient w2,d for the same
feature on the second unit. This means that the weight matrix, after
a gradient step, will change in exactly the same way for every hidden
unit. Thinking through this example for iterations 2 . . . , the values of
the hidden units will always be exactly the same, which means that
the weights feeding in to any of the hidden units will be exactly the
same. Eventually the model will converge, but it will converge to a
solution that does not take advantage of having access to the hidden
units.
This shows that neural networks are sensitive to their initialization.
In particular, the function that they optimize is non-convex, meaning
that it might have plentiful local optima. (One of which is the trivial
local optimum described in the preceding paragraph.) In a sense,
neural networks must have local optima. Suppose you have a two
layer network with two hidden units that’s been optimized. You have
weights w1 from inputs to the first hidden unit, weights w2 from in-
puts to the second hidden unit and weights (v1 , v2 ) from the hidden
units to the output. If I give you back another network with w1 and
w2 swapped, and v1 and v2 swapped, the network computes exactly
the same thing, but with a markedly different weight structure. This
phenomena is known as symmetric modes (“mode” referring to an
optima) meaning that there are symmetries in the weight space. It
would be one thing if there were lots of modes and they were all
symmetric: then finding one of them would be as good as finding
any other. Unfortunately there are additional local optima that are
not global optima.
Random initialization of the weights of a network is a way to
address both of these problems. By initializing a network with small
random weights (say, uniform between −0.1 and 0.1), the network is
unlikely to fall into the trivial, symmetric local optimum. Moreover,
by training a collection of networks, each with a different random Figure 8.3: convergence of randomly
initialized networks
neural networks 123
initialization, you can often obtain better solutions that with just
one initialization. In other words, you can train ten networks with
different random seeds, and then pick the one that does best on held-
out data. Figure 8.3 shows prototypical test-set performance for ten
networks with different random initialization, plus an eleventh plot
for the trivial symmetric network initialized with zeros.
One of the typical complaints about neural networks is that they
are finicky. In particular, they have a rather large number of knobs to
tune:
4. The initialization
Algorithm 26 ForwardPropagation(x)
1: for all input nodes u do
2: hu ← corresponding feature of x
3: end for
5: av ← ∑u∈par(v) w(u,v) hu
6: hv ← tanh( av )
7: end for
8: return a y
Algorithm 27 BackPropagation(x, y)
1: run ForwardPropagation(x) to compute activations
2: ey ← y − ay // compute overall network error
3: for all nodes v in the network whose error ev is computed do
4: for all u ∈ par(v) do
5: gu,v ← −ev hu // compute gradient of this edge
6: eu ← eu + ev wu,v (1 − tanh2 ( au )) // compute the “error” of the parent node
7: end for
8: end for
9: return all gradients ge
on the output unit). The graph has D-many inputs (i.e., nodes with
no parent), whose activations hu are given by an input example. An
edge (u, v) is from a parent to a child (i.e., from an input to a hidden
unit, or from a hidden unit to the sink). Each edge has a weight wu,v .
We say that par(u) is the set of parents of u.
There are two relevant algorithms: forward-propagation and back-
propagation. Forward-propagation tells you how to compute the
activation of the sink y given the inputs. Back-propagation computes
derivatives of the edge weights for a given input.
The key aspect of the forward-propagation algorithm is to iter-
atively compute activations, going deeper and deeper in the DAG.
Once the activations of all the parents of a node u have been com-
puted, you can compute the activation of node u. This is spelled out
in Algorithm 8.4. This is also explained pictorially in Figure 8.6.
Figure 8.6: picture of forward prop
Back-propagation (see Algorithm 8.4) does the opposite: it com-
putes gradients top-down in the network. The key idea is to compute
an error for each node in the network. The error at the output unit is
the “true error.” For any input unit, the error is the amount of gradi-
ent that we see coming from our children (i.e., higher in the network).
These errors are computed backwards in the network (hence the
name back-propagation) along with the gradients themselves. This is
also explained pictorially in Figure 8.7.
Given the back-propagation algorithm, you can directly run gradi-
ent descent, using it as a subroutine for computing the gradients.
At this point, you’ve seen how to train two-layer networks and how
to train arbitrary networks. You’ve also seen a theorem that says
that two-layer networks are universal function approximators. This
begs the question: if two-layer networks are so great, why do we care
about deeper networks?
To understand the answer, we can borrow some ideas from CS
theory, namely the idea of circuit complexity. The goal is to show
that there are functions for which it might be a “good idea” to use a
deep network. In other words, there are functions that will require a
huge number of hidden units if you force the network to be shallow,
but can be done in a small number of units if you allow it to be deep.
The example that we’ll use is the parity function which, ironically
enough, is just a generalization of the XOR problem. The function is
defined over binary inputs as:
parity( x) = ∑ xd mod 2 (8.12)
d
(
1 if the number of 1s in x is odd
= (8.13)
0 if the number of 1s in x is even
It is easy to define a circuit of depth O(log2 D ) with O( D )-many
gates for computing the parity function. Each gate is an XOR, ar-
ranged in a complete binary tree, as shown in Figure 8.8. (If you
want to disallow XOR as a gate, you can fix this by allowing the
depth to be doubled and replacing each XOR with an AND, OR and
Figure 8.8: nnet:paritydeep: deep
NOT combination, like you did at the beginning of this chapter.) function for computing parity
This shows that if you are allowed to be deep, you can construct a
circuit with that computes parity using a number of hidden units that
is linear in the dimensionality. So can you do the same with shallow
circuits? The answer is no. It’s a famous result of circuit complexity
that parity requires exponentially many gates to compute in constant
depth. The formal theorem is below:
Theorem 10 (Parity Function Complexity). Any circuit of depth K <
log2 D that computes the parity function of D input bits must contain O e D
gates.
This is a very famous result because it shows that constant-depth
circuits are less powerful that deep circuits. Although a neural net-
work isn’t exactly the same as a circuit, the is generally believed that
the same result holds for neural networks. At the very least, this
gives a strong indication that depth might be an important considera-
tion in neural networks. What is it about neural networks
One way of thinking about the issue of breadth versus depth has that makes it so that the theorem
? about circuits does not apply di-
to do with the number of parameters that need to be estimated. By rectly?
126 a course in machine learning
the heuristic that you need roughly one or two examples for every
parameter, a deep model could potentially require exponentially
fewer examples to train than a shallow model!
This now flips the question: if deep is potentially so much better,
why doesn’t everyone use deep networks? There are at least two
answers. First, it makes the architecture selection problem more
significant. Namely, when you use a two-layer network, the only
hyperparameter to choose is how many hidden units should go in
the middle layer. When you choose a deep network, you need to
choose how many layers, and what is the width of all those layers.
This can be somewhat daunting.
A second issue has to do with training deep models with back-
propagation. In general, as back-propagation works its way down
through the model, the sizes of the gradients shrink. You can work
this out mathematically, but the intuition is simpler. If you are the
beginning of a very deep network, changing one single weight is
unlikely to have a significant effect on the output, since it has to
go through so many other units before getting there. This directly
implies that the derivatives are small. This, in turn, means that back-
propagation essentially never moves far from its initialization when
run on very deep networks. While these small derivatives might
Finding good ways to train deep networks is an active research make training difficult, they might
? be good for other reasons: what
area. There are two general strategies. The first is to attempt to ini- reasons?
tialize the weights better, often by a layer-wise initialization strategy.
This can be often done using unlabeled data. After this initializa-
tion, back-propagation can be run to tweak the weights for whatever
classification problem you care about. A second approach is to use a
more complex optimization procedure, rather than gradient descent.
You will learn about some such procedures later in this book.
At this point, we’ve seen that: (a) neural networks can mimic linear
functions and (b) they can learn more complex functions. A rea-
sonable question is whether they can mimic a KNN classifier, and
whether they can do it efficiently (i.e., with not-too-many hidden
units).
A natural way to train a neural network to mimic a KNN classifier
is to replace the sigmoid link function with a radial basis function
(RBF). In a sigmoid network (i.e., a network with sigmoid links),
the hidden units were computed as hi = tanh(wi , x·). In an RBF
network, the hidden units are computed as:
h i
hi = exp −γi ||wi − x||2 (8.14)
neural networks 127
In other words, the hidden units behave like little Gaussian “bumps”
centered around locations specified by the vectors wi . A one-dimensional
example is shown in Figure 8.9. The parameter γi specifies the width
of the Gaussian bump. If γi is large, then only data points that are
really close to wi have non-zero activations. To distinguish sigmoid
networks from RBF networks, the hidden units are typically drawn
with sigmoids or with Gaussian bumps, as in Figure 8.10.
Training RBF networks involves finding good values for the Gas-
sian widths, γi , the centers of the Gaussian bumps, wi and the con-
nections between the Gaussian bumps and the output unit, v. This
can all be done using back-propagation. The gradient terms for v re-
main unchanged from before, the the derivates for the other variables Figure 8.9: nnet:rbfpicture: a one-D
picture of RBF bumps
differ (see Exercise ??).
One of the big questions with RBF networks is: where should
the Gaussian bumps be centered? One can, of course, apply back-
propagation to attempt to find the centers. Another option is to spec-
ify them ahead of time. For instance, one potential approach is to
have one RBF unit per data point, centered on that data point. If you
carefully choose the γs and vs, you can obtain something that looks
nearly identical to distance-weighted KNN by doing so. This has the
added advantage that you can go futher, and use back-propagation
to learn good Gaussian widths (γ) and “voting” factors (v) for the
nearest neighbor algorithm.
Figure 8.10: nnet:unitsymbols: picture
of nnet with sigmoid/rbf units
8.7 Exercises
Consider an RBF network with
one hidden unit per training point,
Exercise 8.1. TODO. . . centered at that point. What bad
? thing might happen if you use back-
propagation to estimate the γs and
v on this data if you’re not careful?
How could you be careful?
9 | K ERNEL M ETHODS
– Learning Objectives:
• Explain how kernels generalize
both feature combinations and basis
functions.
• Contrast dot products with kernel
products.
• Implement kernelized perceptron.
• Derive a kernelized version of
regularized least squares regression.
Linear models are great because they are easy to understand • Implement a kernelized version of
and easy to optimize. They suffer because they can only learn very the perceptron.
simple decision boundaries. Neural networks can learn more com- • Derive the dual formulation of the
support vector machine.
plex decision boundaries, but lose the nice convexity properties of
many linear models.
One way of getting a linear model to behave non-linearly is to
transform the input. For instance, by adding feature pairs as addi-
tional inputs. Learning a linear model on such a representation is
convex, but is computationally prohibitive in all but very low dimen-
sional spaces. You might ask: instead of explicitly expanding the fea-
ture space, is it possible to stay with our original data representation
and do all the feature blow up implicitly? Surprisingly, the answer is Dependencies:
often “yes” and the family of techniques that makes this possible are
known as kernel approaches.
In Section 4.4, you learned one method for increasing the expressive
power of linear models: explode the feature space. For instance,
a “quadratic” feature explosion might map a feature vector x =
h x1 , x2 , x3 , . . . , x D i to an expanded version denoted φ( x):
(Note that there are repetitions here, but hopefully most learning
algorithms can deal well with redundant features; in particular, the
2x1 terms are due to collapsing some repetitions.)
kernel methods 129
= 1 + 2x · z + ( x · z)2 (9.4)
2
= (1 + x · z ) (9.5)
Thus, you can compute φ( x) · φ(z) in exactly the same amount of
time as you can compute x · z (plus the time it takes to perform an
addition and a multiply, about 0.02 nanoseconds on a circa 2011
processor).
The rest of the practical challenge is to rewrite your algorithms so
that they only depend on dot products between examples and not on
any explicit weight vectors.
11: return w, b
Figure 9.1:
Proof of Theorem 11. By induction. Base case: the span of any non-
empty set contains the zero vector, which is the initial weight vec-
tor. Inductive case: suppose that the theorem is true before the kth
update, and suppose that the kth update happens on example n.
By the inductive hypothesis, you can write w = ∑i αi φ( xi ) before
kernel methods 131
11: return α, b
Now that you know that you can always write w = ∑n αn φ( xn ) for
some αi s, you can additionall compute the activations (line 4) as:
!
w · φ( x) + b = ∑ αn φ( xn ) · φ( x) + b definition of w
n
(9.6)
h i
= ∑ αn φ( xn ) · φ( x) + b dot products are linear
n
(9.7)
cluster means µ(1) , . . . , µ(K) . The algorithm then alternates between the
following two steps until convergence, with x replaced by φ( x) since
that is the eventual goal:
2
1. For each example n, set cluster label zn = arg mink φ( xn ) − µ(k) .
1
2. For each cluster k, update µ(k) = Nk ∑n:zn =k φ( xn ), where Nk is the
number of n with zn = k.
The question is whether you can perform these steps without ex-
plicitly computing φ( xn ). The representer theorem is more straight-
forward here than in the perceptron. The mean of a set of data is,
almost by definition, in the span of that data (choose the ai s all to be
equal to 1/N). Thus, so long as you initialize the means in the span
of the data, you are guaranteed always to have the means in the span
of the data. Given this, you know that you can write each mean as an
expansion of the data; say that µ(k) = ∑n α(k)
n φ ( xn ) for some parame-
ters α(k)
n (there are N×K-many such parameters).
Given this expansion, in order to execute step (1), you need to
compute norms. This can be done as follows:
2
zn = arg min φ( xn ) − µ(k) definition of zn
k
(9.8)
2
= arg min φ( xn ) − ∑ α(k)
m φ( xm ) definition of µ(k)
k m
(9.9)
2 " #
= arg min ||φ( xn )||2 +
k
∑ α(k)
m φ( xm ) + φ( xn ) · ∑ α(k)
m φ( xm ) expand quadratic term
m m
(9.10)
= arg min ∑ ∑ αm αm0 φ( xm ) · φ( xm0 ) + ∑ αm φ( xm ) · φ( xn ) + const
(k) (k) (k)
linearity and constant
k m m0 m
(9.11)
functions. For instance, using it you can easily prove the following,
which would be difficult from the definition of kernels as inner prod-
ucts after feature mappings.
(9.15)
ZZ
= f ( x)K1 ( x, z) f (z)dxdz
ZZ
+ f ( x)K2 ( x, z) f (z)dxdz distributive rule
(9.16)
> 0+0 K1 and K2 are psd
(9.17)
f ( x) = ∑ αn K ( xn , x) + b (9.19)
n
h i
= ∑ αn exp −γ || xn − z||2 (9.20)
n
1
min ||w||2 + C ∑ ξ n (9.23)
w,b,ξ 2 n
subj. to yn (w · xn + b) ≥ 1 − ξ n (∀n)
ξn ≥ 0 (∀n)
1
L(w, b, ξ, α, β) = ||w||2 + C ∑ ξ n − ∑ β n ξ n (9.24)
2 n n
− ∑ αn [yn (w · xn + b) − 1 + ξ n ] (9.25)
n
The intuition is exactly the same as before. If you are able to find a
solution that satisfies the constraints (e.g., the purple term is prop-
erly non-negative), then the β n s cannot do anything to “hurt” the
solution. On the other hand, if the purple term is negative, then the
corresponding β n can go to +∞, breaking the solution.
You can solve this problem by taking gradients. This is a bit te-
dious, but and important step to realize how everything fits together.
Since your goal is to remove the dependence on w, the first step is to
take a gradient with respect to w, set it equal to zero, and solve for w
in terms of the other variables.
∇w L = w − ∑ αn yn xn = 0 ⇐⇒ w= ∑ αn yn xn (9.27)
n n
2
1
L(b, ξ, α, β) =
2 ∑ αm ym xm + C ∑ ξ n − ∑ βn ξ n (9.28)
m n n
" " # ! #
− ∑ αn yn ∑ αm ym xm · xn + b − 1 + ξ n
n m
(9.29)
At this point, it’s convenient to rewrite these terms; be sure you un-
derstand where the following comes from:
1
2∑ ∑ αn αm yn ym xn · xm + ∑(C − β n )ξ n
L(b, ξ, α, β) = (9.30)
n m n
− ∑ ∑ αn αm yn ym xn · xm − ∑ αn (yn b − 1 + ξ n )
n m n
(9.31)
1
2∑ ∑ αn αm yn ym xn · xm + ∑(C − β n )ξ n
=− (9.32)
n m n
− b ∑ α n y n − ∑ α n ( ξ n − 1) (9.33)
n n
Things are starting to look good: you’ve successfully removed the de-
pendence on w, and everything is now written in terms of dot prod-
ucts between input vectors! This might still be a difficult problem to
solve, so you need to continue and attempt to remove the remaining
variables b and ξ.
The derivative with respect to b is:
∂L
= − ∑ αn yn = 0 (9.34)
∂b n
This doesn’t allow you to substitute b with something (as you did
with w), but it does mean that the fourth term (b ∑n αn yn ) goes to
zero at the optimum.
The last of the original variables is ξ n ; the derivatives in this case
look like:
∂L
= C − β n − αn ⇐⇒ C − β n = αn (9.35)
∂ξ n
Again, this doesn’t allow you to substitute, but it does mean that you
can rewrite the second term, which as ∑n (C − β n )ξ n as ∑n αn ξ n . This
then cancels with (most of) the final term. However, you need to be
careful to remember something. When we optimize, both αn and β n
are constrained to be non-negative. What this means is that since we
are dropping β from the optimization, we need to ensure that αn ≤ C,
otherwise the corresponding β will need to be negative, which is not
138 a course in machine learning
1
L(α) = ∑ αn −
2∑ ∑ αn αm yn ym K ( xn , xm ) (9.36)
n n m
If you are comfortable with matrix notation, this has a very compact
form. Let 1 denote the N-dimensional vector of all 1s, let y denote
the vector of labels and let G be the N×N matrix, where Gn,m =
yn ym K ( xn , xm ), then this has the following form:
1
L(α) = α> 1 − α> Gα (9.37)
2
1
2∑ ∑ αn αm yn ym K ( xn , xm ) − ∑ αn
min − L(α) = (9.38)
α
n m n
subj. to 0 ≤ αn ≤ C (∀n)
get as large as possible. The constraint ensures that they cannot ex-
ceed C, which means that the general tendency is for the αs to grow
as close to C as possible.
To further understand the dual optimization problem, it is useful
to think of the kernel as being a measure of similarity between two
data points. This analogy is most clear in the case of RBF kernels,
but even in the case of linear kernels, if your examples all have unit
norm, then their dot product is still a measure of similarity. Since you
can write the prediction function as f ( x̂) = sign(∑n αn yn K ( xn , x̂)), it
is natural to think of αn as the “importance” of training example n,
where αn = 0 means that it is not used at all at test time.
Consider two data points that have the same label; namely, yn =
ym . This means that yn ym = +1 and the objective function has a term
that looks like αn αm K ( xn , xm ). Since the goal is to make this term
small, then one of two things has to happen: either K has to be small,
or αn αm has to be small. If K is already small, then this doesn’t affect
the setting of the corresponding αs. But if K is large, then this strongly
encourages at least one of αn or αm to go to zero. So if you have two
data points that are very similar and have the same label, at least one
of the corresponding αs will be small. This makes intuitive sense: if
you have two data points that are basically the same (both in the x
and y sense) then you only need to “keep” one of them around.
Suppose that you have two data points with different labels:
yn ym = −1. Again, if K ( xn , xm ) is small, nothing happens. But if
it is large, then the corresponding αs are encouraged to be as large as
possible. In other words, if you have two similar examples with dif-
ferent labels, you are strongly encouraged to keep the corresponding
αs as large as C.
An alternative way of understanding the SVM dual problem is
geometrically. Remember that the whole point of introducing the
variable αn was to ensure that the nth training example was correctly
classified, modulo slack. More formally, the goal of αn is to ensure
that yn (w · xn + b) − 1 + ξ n ≥ 0. Suppose that this constraint it
not satisfied. There is an important result in optimization theory,
called the Karush-Kuhn-Tucker conditions (or KKT conditions, for
short) that states that at the optimum, the product of the Lagrange
multiplier for a constraint, and the value of that constraint, will equal
zero. In this case, this says that at the optimum, you have:
h i
αn yn (w · xn + b) − 1 + ξ n = 0 (9.39)
In order for this to be true, it means that (at least) one of the follow-
ing must be true:
αn = 0 or yn (w · xn + b) − 1 + ξ n = 0 (9.40)
140 a course in machine learning
9.7 Exercises
For nothing ought to be posited without a reason given, unless it Learning Objectives:
is self-evident or known by experience or proved by the authority • Explain why inductive bias is
necessary.
of Sacred Scripture. – William of Occam, c. 1320
• Define the PAC model and explain
why both the “P” and “A” are
necessary.
• Explain the relationship between
complexity measures and regulariz-
ers.
By now, you are an expert at building learning algorithms. You • Identify the role of complexity in
generalization.
probably understand how they work, intuitively. And you under-
• Formalize the relationship between
stand why they should generalize. However, there are several basic
margins and complexity.
questions you might want to know the answer to. Is learning always
possible? How many training examples will I need to do a good job
learning? Is my test performance going to be much worse than my
training performance? The key idea that underlies all these answer is
that simple functions generalize well.
The amazing thing is that you can actually prove strong results
that address the above questions. In this chapter, you will learn
some of the most important results in learning theory that attempt
to answer these questions. The goal of this chapter is not theory for
Dependencies:
theory’s sake, but rather as a way to better understand why learning
models work, and how to use this theory to build better algorithms.
As a concrete example, we will see how 2-norm regularization prov-
ably leads to better generalization performance, thus justifying our
common practice!
One nice thing about theory is that it forces you to be precise about
what you are trying to do. You’ve already seen a formal definition
of binary classification in Chapter 5. But let’s take a step back and
re-analyze what it means to learn to do binary classification.
From an algorithmic perspective, a natural question is whether
there is an “ultimate” learning algorithm, Aawesome , that solves the
Binary Classification problem above. In other words, have you been
wasting your time learning about KNN and Perceptron and decision
trees, when Aawesome is out there.
What would such an ultimate learning algorithm do? You would
like it to take in a data set D and produce a function f . No matter
what D looks like, this function f should get perfect classification on
all future examples drawn from the same distribution that produced
D.
A little bit of introspection should demonstrate that this is impos-
sible. For instance, there might be label noise in our distribution. As
a very simple example, let X = {−1, +1} (i.e., a one-dimensional,
binary distribution. Define the data distribution as:
There are two notions of efficiency that matter in PAC learning. The
first is the usual notion of computational complexity. You would prefer
an algorithm that runs quickly to one that takes forever. The second
is the notion of sample complexity: the number of examples required
for your algorithm to achieve its goals. Note that the goal of both
of these measure of complexity is to bound how much of a scarse
resource your algorithm uses. In the computational case, the resource
is CPU cycles. In the sample case, the resource is labeled examples.
The first thing to notice about this algorithm is that after processing
an example, it is guaranteed to classify that example correctly. This
observation requires that there is no noise in the data.
The second thing to notice is that it’s very computationally ef-
ficient. Given a data set of N examples in D dimensions, it takes
O( ND ) time to process the data. This is linear in the size of the data
set.
However, in order to be an efficient (e, δ)-PAC learning algorithm,
you need to be able to get a bound on the sample complexity of this
algorithm. Sure, you know that its run time is linear in the number
of example N. But how many examples N do you need to see in order
to guarantee that it achieves an error rate of at most e (in all but δ-
D/e
many cases)? Perhaps N has to be gigantic (like 22 ) to (probably)
guarantee a small error.
The goal is to prove that the number of samples N required to
(probably) achieve a small error is not-too-big. The general proof
technique for this has essentially the same flavor as almost every PAC
learning proof around. First, you define a “bad thing.” In this case,
a “bad thing” is that there is some term (say ¬ x8 ) that should have
been thrown out, but wasn’t. Then you say: well, bad things happen.
Then you notice that if this bad thing happened, you must not have
146 a course in machine learning
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬ x1 ∧ x2 ∧ ¬ x2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function
2: for all positive examples ( x,+1) in D do
3: for d = 1 . . . D do
4: if xd = 0 then
5: f ← f without term “xd ”
6: else
7: f ← f without term “¬ xd ”
8: end if
9: end for
10: end for
11: return f
Proof of Theorem 13. Let c be the concept you are trying to learn and
let D be the distribution that generates the data.
A learned function f can make a mistake if it contains any term t
that is not in c. There are initially 2D many terms in f , and any (or
all!) of them might not be in c. We want to ensure that the probability
that f makes an error is at most e. It is sufficient to ensure that
For a term t (e.g., ¬ x5 ), we say that t “negates” an example x if
t( x) = 0. Call a term t “bad” if (a) it does not appear in c and (b) has
probability at least e/2D of appearing (with respect to the unknown
distribution D over data points).
First, we show that if we have no bad terms left in f , then f has an
error rate at most e.
We know that f contains at most 2D terms, since is begins with 2D
terms and throws them out.
The algorithm begins with 2D terms (one for each variable and
one for each negated variable). Note that f will only make one type
of error: it can call positive examples negative, but can never call a
negative example positive. Let c be the true concept (true boolean
formula) and call a term “bad” if it does not appear in c. A specific
bad term (e.g., ¬ x5 ) will cause f to err only on positive examples
that contain a corresponding bad value (e.g., x5 = 1). TODO... finish
this
TODO COMMENTS
This theorem applies directly to the “Throw Out Bad Terms” algo-
rithm, since (a) the hypothesis class is finite and (b) the learned func-
tion always achieves zero error on the training data. To apply Oc-
cam’s Bound, you need only compute the size of the hypothesis class
H of boolean conjunctions. You can compute this by noticing that
there are a total of 2D possible terms in any formula in H. Moreover,
each term may or may not be in a formula. So there are 22D = 4D
possible formulae; thus, |H| = 4D . Applying Occam’s Bound, we see
that the sample complexity of this algorithm is N ≤ . . . .
Of course, Occam’s Bound is general enough to capture other
learning algorithms as well. In particular, it can capture decision
trees! In the no-noise setting, a decision tree will always fit the train-
ing data perfectly. The only remaining difficulty is to compute the
size of the hypothesis class of a decision tree learner.
For simplicity’s sake, suppose that our decision tree algorithm
always learns complete trees: i.e., every branch from root to leaf
is length D. So the number of split points in the tree (i.e., places
where a feature is queried) is 2D−1 . (See Figure 10.1.) Each split
point needs to be assigned a feature: there D-many choices here. Figure 10.1: thy:dt: picture of full
This gives D2D−1 trees. The last thing is that there are 2D leaves decision tree
of the tree, each of which can take two possible values, depending
on whether this leaf is classified as +1 or −1: this is 2×2D = 2D+1
possibilities. Putting this all togeter gives a total number of trees
|H| = D2D−1 2D+1 = D22D = D4D . Applying Occam’s Bound, we see
that TODO examples is enough to learn a decision tree!
Despite the fact that there’s no way to get better than 20% error on
this distribution, it would be nice to say that you can still learn some-
thing from it. For instance, the predictor that always guesses y = x
seems like the “right” thing to do. Based on this observation, maybe
we can rephrase the goal of learning as to find a function that does
as well as the distribution allows. In other words, on this data, you
would hope to get 20% error. On some other distribution, you would
hope to get X% error, where X% is the best you could do.
This notion of “best you could do” is sufficiently important that
it has a name: the Bayes error rate. This is the error rate of the best
possible classifier, the so-called Bayes optimal classifier. If you knew
the underlying distribution D , you could actually write down the
exact Bayes optimal classifier explicitly. (This is why learning is unin-
teresting in the case that you know D .) It simply has the form:
(
Bayes +1 if D( x, +1) > D( x, −1)
f ( x) = (10.7)
−1 otherwise
The Bayes optimal error rate is the error rate that this (hypothetical)
classifier achieves:
eBayes = E( x,y)∼D y 6= f Bayes ( x)
(10.8)
learning theory 151
10.10 Exercises
All of the learning algorithms you have seen so far are deterministic.
If you train a decision tree multiple times on the same data set, you
will always get the same tree back. In order to get an effect out of
voting multiple classifiers, they need to differ. There are two primary
ways to get variability. You can either change the learning algorithm
or change the data set.
Building an emsemble by training different classifiers is the most
straightforward approach. As in single-model learning, you are given
a data set (say, for classification). Instead of learning a single classifier
(e.g., a decision tree) on this data set, you learn multiple different
classifiers. For instance, you might train a decision tree, a perceptron,
a KNN, and multiple neural networks with different architectures.
Call these classifiers f 1 , . . . , f M . At test time, you can make a predic-
tion by voting. On a test example x̂, you compute ŷ1 = f 1 ( x̂ ), . . . ,
ensemble methods 153
Algorithm 31 AdaBoost( W , D , K)
1: d(0) ← h N1 , N1 , . . . , N1 i // Initialize uniform importance to each example
2: for k = 1 . . . K do
3: f (k) ← W (D , d(k-1) ) // Train kth classifier on weighted data
4: ŷn ← f (k) ( xn ), ∀n // Make predictions on training data
5: ê(k) ← ∑n d(k-1)
n [yn 6= ŷn ] // Compute weighted training error
1 1−ê(k)
6: α(k) ← 2 log ê(k)
// Compute “adaptive” parameter
(k) 1 (k-1)
7: dn ← Z dn exp[−α(k) yn ŷn ],
∀n // Re-weight examples and normalize
8: end for
return f ( x̂) = sgn ∑k α(k) f (k) ( x̂)
9: // Return (weighted) voted classifier
= sgn [w · x + b] (11.4)
ensemble methods 157
You can notice that this is nothing but a two-layer neural network,
with K-many hidden units! Of course it’s not a classifically trained
neural network (once you learn w(k) you never go back and update
it), but the structure is identical.
11.4 Exercises
g= ∑ ∇w `(yn , w · xn ) + λw (12.1)
n
where `(y, ŷ) is some loss function. Then you update the weights by
w ← w − ηg. In this algorithm, in order to make a single update, you
have to look at every training example.
When there are billions of training examples, it is a bit silly to look
at every one before doing anything. Perhaps just on the basis of the
first few examples, you can already start learning something!
Stochastic optimization involves thinking of your training data
as a big distribution over examples. A draw from this distribution
corresponds to picking some example (uniformly at random) from
your data set. Viewed this way, the optimization problem becomes a
stochastic optimization problem, because you are trying to optimize
some function (say, a regularized linear classifier) over a probability
distribution. You can derive this intepretation directly as follows:
Algorithm 33 StochasticGradientDescent(F , D , S, K, η1 , . . . )
1: z(0) ← h0, 0, . . . , 0i // initialize variable we are optimizing
2: for k = 1 . . . K do
3: D(k) ← S-many random data points from D
4: g (k) ← ∇z F ( D(k) ) z(k-1) // compute gradient on sample
5: z(k) ← z(k-1) − η (k) g (k) // take a step down the gradient
6: end for
7: return z(K)
optimization problem:
1
gd
if wd > 0 and gd ≤ w
η (k) d
gd ← 1
gd if wd < 0 and gd ≥ w
η (k) d
(12.8)
0 otherwise
1. Initialize x̂ = h0, 0, . . . , 0i
2. For each d = 1 . . . D:
3. Return x̂
φ( x) p = ∑[ h(d) = p] xd = ∑ xd (12.9)
d d ∈ h −1 ( p )
=∑ ∑ xd ze (12.13)
d e∈h−1 (h(d))
= x·z+∑ ∑ xd ze (12.14)
d e6=d,
e∈h−1 (h(d))
This hash kernel has the form of a linear kernel plus a small number
of quadratic terms. The particular quadratic terms are exactly those
given by collisions of the hash function.
There are two things to notice about this. The first is that collisions
might not actually be bad things! In a sense, they’re giving you a
little extra representational power. In particular, if the hash function
happens to select out feature pairs that benefit from being paired,
then you now have a better representation. The second is that even if
this doesn’t happen, the quadratic term in the kernel has only a small
effect on the overall prediction. In particular, if you assume that your
hash function is pairwise independent (a common assumption of
hash functions), then the expected value of this quadratic term is zero,
and its variance decreases at a rate of O( P−2 ). In other words, if you
choose P ≈ 100, then the variance is on the order of 0.0001.
12.5 Exercises
Unsupervised learning is learning without a teacher. One basic • Implement agglomerative clustering.
thing that you might want to do with data is to visualize it. Sadly, it • Argue whether spectral cluster-
ing is a clustering algorithm or a
is difficult to visualize things in more than two (or three) dimensions, dimensionality reduction algorithm.
and most data is in hundreds of dimensions (or more). Dimension-
ality reduction is the problem of taking high dimensional data and
embedding it in a lower dimension space. Another thing you might
Dependencies:
want to do is automatically derive a partitioning of the data into
clusters. You’ve already learned a basic approach for doing this: the
k-means algorithm (Chapter 2). Here you will analyze this algorithm
to see why it works. You will also learn more advanced clustering
approaches.
Algorithm 34 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: µk ← mean({ xn : zn = k }) // re-estimate mean of cluster k
10: end for
11: until converged
is optimizing is the sum of squared distances from any data point to its
assigned center. This is a natural generalization of the definition of a
mean: the mean of a set of points is the single point that minimizes
the sum of squared distances from the mean to every point in the
data. Formally, the K-Means objective is:
2
L(z, µ; D) = ∑ xn − µzn =∑ ∑ || xn − µk ||2 (13.1)
n k n:zn =k
means that you should usually run it 10 times with different initial-
izations and pick the one with minimal resulting L. Second, one
can show that there are input datasets and initializations on which
it might take an exponential amount of time to converge. Fortu-
nately, these cases almost never happen in practice, and in fact it has
recently been shown that (roughly) if you limit the floating point pre-
cision of your machine, K-means will converge in polynomial time
(though still only to a local optimum), using techniques of smoothed
analysis.
The biggest practical issue in K-means is initialization. If the clus-
ter means are initialized poorly, you often get convergence to uninter-
esting solutions. A useful heuristic is the furthest-first heuristic. This
gives a way to perform a semi-random initialization that attempts to
pick initial means as far from each other as possible. The heuristic is
sketched below:
2. For k = 2 . . . K:
(a) Find the example m that is as far as possible from all previ-
ously selected means; namely: m = arg maxm mink0 <k || xm − µk0 ||2
and set µk = xm
of the objective. This doesn’t tell you that you will reach the global
optimum, but it does tell you that you will get reasonably close. In
particular, if L̂ is the value obtained by running K-means++, then this
will not be “too far” from L(opt) , the true global minimum.
Figure 13.1:
max
u
∑ ( x n · u )2 subj. to ||u||2 = 1 (13.5)
n
You can solve this expression (λu = X> Xu) by computing the first
eigenvector and eigenvalue of the matrix X> X.
This gives you the solution to a projection into a one-dimensional
space. To get a second dimension, you want to find a new vector v on
which the data has maximal variance. However, to avoid redundancy,
you want v to be orthogonal to u; namely u · v = 0. This gives:
Algorithm 36 PCA(D, K)
1: µ ← mean(X) // compute data mean for centering
2: D← X − µ1> > X − µ1> // compute covariance, 1 is a vector of ones
3: {λk , uk } ← top K eigenvalues/eigenvectors of D
4: return (X − µ1) U // project data using U
2 2
X − Zu> = X − Xuu> definition of Z
(13.14)
2
= ||X||2 + Xuu> − 2X> Xuu> quadratic rule
(13.15)
unsupervised learning 173
2
= ||X||2 + Xuu> − 2u> X> Xu quadratic rule
(13.16)
= ||X||2 + ||X||2 − 2u> X> Xu u is a unit vector
(13.17)
= C − 2 ||Xu||2 join constants, rewrite last term
(13.18)
what is a manifold?
graph construction
isomap
lle
mvu
mds?
what is a spectrum
spectral clustering
174 a course in machine learning
13.6 Exercises
Suppose you were building a naive Bayes model for a text cate- • Implement EM for clustering with
mixtures of Gaussians, and contrast-
gorization problem. After you were done, your boss told you that it ing it with k-means.
became prohibitively expensive to obtain labeled data. You now have • Evaluate the differences betweem
a probabilistic model that assumes access to labels, but you don’t EM and gradient descent for hidden
variable models.
have any labels! Can you still do something?
Amazingly, you can. You can treat the labels as hidden variables,
and attempt to learn them at the same time as you learn the param-
eters of your model. A very broad family of algorithms for solving
problems just like this is the expectation maximization family. In this
chapter, you will derive expectation maximization (EM) algorithms
for clustering and dimensionality reduction, and then see why EM
works. Dependencies:
If you had access to labels, this would be all well and good, and
you could obtain closed form solutions for the maximum likelihood
estimates of all parameters by taking a log and then taking gradients
of the log likelihood:
Algorithm 37 GMM(X, K)
1: for k = 1 to K do
6: repeat
7: for n = 1 to N do
8: for k = 1 to K do
− D h i
zn,k ← θk 2πσk2 2 exp − 2σ1 2 || xn − µk ||2
9: // compute
k
(unnormalized) fractional assignments
10: end for
11: zn ← ∑ 1zn,k zn // normalize fractional assignments
k
12: end for
13: for k = 1 to K do
14: θk ← N1 ∑n zn,k // re-estimate prior probability of cluster k
∑n zn,k xn
15: µk ← ∑n zn,k
// re-estimate mean of cluster k
∑n zn,k || xn −µk ||
16: σk2
← ∑n zn,k
// re-estimate variance of cluster k
17: end for
18: until converged
19: return z // return cluster assignments
All that has happened here is that the hard assignments “[yn = k]”
have been replaced with soft assignments “zn,k ”. As a bit of fore-
shadowing of what is to come, what we’ve done is essentially replace
known labels with expected labels, hence the name “expectation maxi-
mization.”
Putting this together yields Algorithm 14.1. This is the GMM
(“Gaussian Mixture Models”) algorithm, because the probabilitic
model being learned describes a dataset as being drawn from a mix-
ture distribution, where each component of this distribution is a
178 a course in machine learning
At this point, the natural thing to do is to take logs and then start
taking gradients. However, once you start taking logs, you run into a
expectation maximization 179
Namely, the log gets “stuck” outside the sum and cannot move in to
decompose the rest of the likelihood term!
The next step is to apply the somewhat strange, but strangely
useful, trick of multiplying by 1. In particular, let q(·) be an arbitrary
probability distribution. We will multiply the p(. . . ) term above by
q(yn )/q(yn ), a valid step so long as q is never zero. This leads to:
p( xn , yn | θ)
L(X | θ) = ∑ log ∑ q(yn ) (14.16)
n yn q(yn )
p( xn , yn | θ)
L(X | θ) ≥ ∑ ∑ q(yn ) log (14.17)
n yn q(yn )
h i
= ∑ ∑ q(yn ) log p( xn , yn | θ) − q(yn ) log q(yn ) (14.18)
n yn
, L̃(X | θ) (14.19)
Note that this inequality holds for any choice of function q, so long as
its non-negative and sums to one. In particular, it needn’t even by the
same function q for each n. We will need to take advantage of both of
these properties.
We have succeeded in our first goal: constructing a lower bound
on L. When you go to optimize this lower bound for θ, the only part
that matters is the first term. The second term, q log q, drops out as a
function of θ. This means that the the maximization you need to be
able to compute, for fixed qn s, is:
The second question is: what should qn (·) actually be? Any rea-
sonable q will lead to a lower bound, so in order to choose one q over
another, we need another criterion. Recall that we are hoping to max-
imize L by instead maximizing a lower bound. In order to ensure
that an increase in the lower bound implies an increase in L, we need
to ensure that L(X | θ) = L̃(X | θ). In words: L̃ should be a lower
bound on L that makes contact at the current point, θ. This is shown
in Figure ??, including a case where the lower bound does not make
contact, and thereby does not guarantee an increase in L with an
increase in L̃.
derivation
advantages over pca
14.5 Exercises
key assumption
graphs and manifolds
label prop
density assumption
loss function
non-convex
182 a course in machine learning
motivation
qbc
qbu
15.6 Exercises
16.1 Exercises
Dependencies: None.
17 | O NLINE L EARNING
Learning Objectives:
• Explain the experts model, and why
it is hard even to compete with the
single best expert.
• Define what it means for an online
learning algorithm to have no regret.
• Implement the follow-the-leader
algorithm.
• Categorize online learning algo-
rithms in terms of how they measure
All of the learning algorithms that you know about at this changes in parameters, and how
point are based on the idea of training a model on some data, and they measure error.
evaluating it on other data. This is the batch learning model. How-
ever, you may find yourself in a situation where students are con-
stantly rating courses, and also constantly asking for recommenda-
tions. Online learning focuses on learning over a stream of data, on
which you have to make predictions continually.
You have actually already seen an example of an online learning
algorithm: the perceptron. However, our use of the perceptron and
our analysis of its performance have both been in a batch setting. In
this chapter, you will see a formalization of online learning (which
differs from the batch learning formalization) and several algorithms Dependencies:
for online learning with different properties.
regret
follow the leader
agnostic learning
algorithm versus problem
pa algorithm
online analysis
online learning 185
winnow
relationship to egd
17.5 Exercises
18.1 Exercises
Dependencies:
Exercise 18.1. TODO. . .
19 | BAYESIAN L EARNING
Learning Objectives:
• TODO. . .
19.1 Exercises
Dependencies:
C ODE AND DATASETS