0% found this document useful (0 votes)
125 views

6.036 Notes

This document provides an overview of machine learning concepts including supervised and unsupervised learning problems, model types, algorithms, and evaluation criteria. It then describes several specific machine learning algorithms in detail, including linear classifiers like perceptrons, support vector machines, regression, and neural networks. Gradient descent optimization is discussed as a method for training models.

Uploaded by

goyo2k
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

6.036 Notes

This document provides an overview of machine learning concepts including supervised and unsupervised learning problems, model types, algorithms, and evaluation criteria. It then describes several specific machine learning algorithms in detail, including linear classifiers like perceptrons, support vector machines, regression, and neural networks. Gradient descent optimization is discussed as a method for training models.

Uploaded by

goyo2k
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

6.

036 Lecture Notes


Contents

1 Introduction 4
1 Problem class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . 6
1.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Sequence learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Other settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Model type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 No model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Prediction rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Model class and parameter fitting . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Linear classifers 11
1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Learning linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Evaluating a learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 The Perceptron 15
1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Theory of the perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Linear separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1
MIT 6.036 Fall 2018 2

4 Feature representation 21
1 Polynomial basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Hand-constructing features for real domains . . . . . . . . . . . . . . . . . . 24
2.1 Discrete features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Numeric values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Margin Maximization 27
1 Machine learning as optimization . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Maximizing the margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Gradient Descent 33
1 One dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Multiple dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Application to SVM objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Regression 38
1 Analytical solution: ordinary least squares . . . . . . . . . . . . . . . . . . . 39
2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Optimization via gradient descent . . . . . . . . . . . . . . . . . . . . . . . . 41

8 Neural Networks 43
1 Basic element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1 Single layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2 Many layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Choices of activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Error back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Loss functions and activation functions . . . . . . . . . . . . . . . . . . . . . 52
6.1 Two-class classification and log likelihood . . . . . . . . . . . . . . . 52
6.2 Multi-class classification and log likelihood . . . . . . . . . . . . . . . 53
7 Optimizing neural network parameters . . . . . . . . . . . . . . . . . . . . . 54
7.1 Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.2 Adaptive step-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.1 Running averages . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2.3 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1 Methods related to ridge regression . . . . . . . . . . . . . . . . . . . 58
8.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

9 Convolutional Neural Networks 60


1 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2 Max Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 Typical architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 3

10 Sequential models 66
1 State machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.1 Finite-horizon solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.1.1 Evaluating a given policy . . . . . . . . . . . . . . . . . . . . 69
2.1.2 Finding an optimal policy . . . . . . . . . . . . . . . . . . . 69
2.2 Infinite-horizon solutions . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.2.1 Evaluating a policy . . . . . . . . . . . . . . . . . . . . . . . 71
2.2.2 Finding an optimal policy . . . . . . . . . . . . . . . . . . . 71
2.2.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

11 Reinforcement learning 73
1 Bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2 Sequential problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.1 Model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.2 Policy search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3 Value function learning . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3.2 Function approximation . . . . . . . . . . . . . . . . . . . . 78
2.3.3 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . 79

12 Recurrent Neural Networks 80


1 RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2 Sequence-to-sequence RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3 Back-propagation through time . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Training a language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5 Vanishing gradients and gating mechanisms . . . . . . . . . . . . . . . . . . 85
5.1 Simple gated recurrent networks . . . . . . . . . . . . . . . . . . . . . 86
5.2 Long short-term memory . . . . . . . . . . . . . . . . . . . . . . . . . 86

13 Recommender systems 88
1 Content-based recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 88
2 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.1.1 Alternating least squares . . . . . . . . . . . . . . . . . . . . 93
2.1.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . 94

14 Non-parametric methods 95
1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.1.1 Building a tree . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.1.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Last Updated: 02/01/19 01:38:02


CHAPTER 1

Introduction

The main focus of machine learning is making decisions or predictions based on data. There
are a number of other fields with significant overlap in technique, but difference in focus:
in economics and psychology, the goal is to discover underlying causal processes and in This story paraphrased
statistics it is to find a model that fits a data set well. In those fields, the end product is a from a post on 9/4/12
at andrewgelman.com
model. In machine learning, we often fit models, but as a means to the end of making good
predictions or decisions.
As machine-learning (ML) methods have improved in their capability and scope, ML
has become the best way, measured in terms of speed, human engineering time, and ro-
bustness, to make many applications. Great examples are face detection and speech recog-
nition and many kinds of language-processing tasks. Almost any application that involves
understanding data or signals that come from the real world can be best addressed using
machine learning.
One crucial aspect of machine learning approaches to solving problems is that human and often undervalued
engineering plays an important role. A human still has to frame the problem: acquire and
organize data, design a space of possible solutions, select a learning algorithm and its pa-
rameters, apply the algorithm to the data, validate the resulting solution to decide whether
it’s good enough to use, etc. These steps are of great importance.
The conceptual basis of learning from data is the problem of induction: Why do we think Bertrand Russell is my
that previously seen data will help us predict the future? This is a serious philosophical hero. –lpk
problem of long standing. We will operationalize it by making assumptions, such as that
all training data are IID (independent and identically distributed) and that queries will be
drawn from the same distribution as the training data, or that the answer comes from a set
of possible answers known in advance.
In general, we need to solve these two problems:

• estimation: When we have data that are noisy reflections of some underlying quan-
tity of interest, we have to aggregate the data and make estimates or predictions
about the quantity. How do we deal with the fact that, for example, the same treat-
ment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?

• generalization: How can we predict results of a situation or experiment that we have


never encountered before in our data set?

4
MIT 6.036 Fall 2018 5

We can describe problems and their solutions using six characteristics, three of which
characterize the problem and three of which characterize the solution:

1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?

2. Assumptions: What do we know about the source of the data or the form of the
solution?

3. Evaluation criteria: What is the goal of the prediction or estimation system? How
will the answers to individual queries be evaluated? How will the overall perfor-
mance of the system be measured?

4. Model type: Will an intermediate model be made? What aspects of the data will be
modeled? How will the model be used to make predictions?

5. Model class: What particular parametric class of models will be used? What criterion
will we use to pick a particular model from the model class?

6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?

Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization. In the following sections, we elaborate on these ideas.
Don’t feel you have
to memorize all these
kinds of learning, etc.
We just want you to
1 Problem class have a very high-level
view of (part of) the
There are many different problem classes in machine learning. They vary according to what breadth of the field.
kind of data is provided and what kind of conclusions are to be drawn from it. Five stan-
dard problem classes are described below, to establish some notation and terminology.
In this course, we will focus on classification and regression (two examples of super-
vised learning), and will touch on reinforcement learning and sequence learning.

1.1 Supervised learning


The idea of supervised learning is that the learning system is given inputs and told which
specific outputs should be associated with them. We divide up supervised learning based
on whether the outputs are drawn from a small finite set (classification) or a large finite or
continuous set (regression).

1.1.1 Classification
Training data Dn is in the form of a set of pairs {(x(1) , y(1) ), . . . , (x(n) , y(n) )} where x(i)
represents an object to be classified, most typically a d-dimensional vector of real and/or
discrete values, and y(i) is an element of a discrete set of values. The y values are some- Many textbooks use xi
times called target values. and ti instead of x(i)
A classification problem is binary or two-class if y(i) is drawn from a set of two possible and y(i) . We find that
notation somewhat dif-
values; otherwise, it is called multi-class. ficult to manage when
The goal in a classification problem is ultimately, given a new input value x(n+1) , to x(i) is itself a vector and
predict the value of y(n+1) . we need to talk about
Classification problems are a kind of supervised learning, because the desired output (or its elements. The no-
tation we are using is
class) y(i) is specified for each of the training examples x(i) . standard in some other
parts of the machine-
learning literature.
Last Updated: 02/01/19 01:38:02
MIT 6.036 Fall 2018 6

1.1.2 Regression
Regression is like classification, except that y(i) ∈ Rk .

1.2 Unsupervised learning


Unsupervised learning doesn’t involve learning a function from inputs to outputs based on
a set of input-output pairs. Instead, one is given a data set and generally expected to find
some patterns or structure inherent in it.

1.2.1 Density estimation


Given samples x(1) , . . . , x(n) ∈ RD drawn IID from some distribution Pr(X), the goal is to IID stands for indepen-
predict the probability Pr(x(n+1) ) of an element drawn from the same distribution. Density dent and identically dis-
tributed, which means
estimation sometimes plays a role as a “subroutine” in the overall learning method for that the elements in the
supervised learning, as well. set are related in the
sense that they all come
from the same under-
1.2.2 Clustering lying probability distri-
bution, but not in any
Given samples x(1) , . . . , x(n) ∈ RD , the goal is to find a partitioning (or “clustering”) of the other ways.
samples that groups together samples that are similar. There are many different objectives,
depending on the definition of the similarity between samples and exactly what criterion
is to be used (e.g., minimize the average distance between elements inside a cluster and
maximize the average distance between elements across clusters). Other methods perform
a “soft” clustering, in which samples may be assigned 0.9 membership in one cluster and
0.1 in another. Clustering is sometimes used as a step in density estimation, and sometimes
to find useful structure in data.

1.2.3 Dimensionality reduction


Given samples x(1) , . . . , x(n) ∈ RD , the problem is to re-represent them as points in a d-
dimensional space, where d < D. The goal is typically to retain information in the data set
that will, e.g., allow elements of one class to be discriminated from another.
Dimensionality reduction is a standard technique which is particularly useful for vi-
sualizing or understanding high-dimensional data. If the goal is ultimately to perform re-
gression or classification on the data after the dimensionality is reduced, it is usually best to
articulate an objective for the overall prediction problem rather than to first do dimension-
ality reduction without knowing which dimensions will be important for the prediction
task.

1.3 Reinforcement learning


In reinforcement learning, the goal is to learn a mapping from input values x to output
values y, but without a direct supervision signal to specify which output values y are
best for a particular input. There is no training set specified a priori. Instead, the learning
problem is framed as an agent interacting with an environment, in the following setting:
• The agent observes the current state, x(0) .
• It selects an action, y(0) .
• It receives a reward, r(0) , which depends on x(0) and possibly y(0) .
• The environment transitions probabilistically to a new state, x(1) , with a distribution
that depends only on x(0) and y(0) .

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 7

• The agent observes the current state, x(1) .

• ...

The goal is to find a policy π, mapping x to y, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agent’s action choices affect both its reward and its ability to observe the envi-
ronment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.

1.4 Sequence learning


In sequence learning, the goal is to learn a mapping from input sequences x0 , . . . , xn to output
sequences y1 , . . . , ym . The mapping is typically represented as a state machine, with one
function f used to compute the next hidden internal state given the input, and another
function g used to compute the output given the current hidden state.
It is supervised in the sense that we are told what output sequence to generate for which
input sequence, but the internal functions have to be learned by some method other than
direct supervision, because we don’t know what the hidden state sequence is.

1.5 Other settings


There are many other problem settings. Here are a few.
In semi-supervised learning, we have a supervised-learning training set, but there may
be an additional set of x(i) values with no known y(i) . These values can still be used to
improve learning performance if they are drawn from Pr(X) that is the marginal of Pr(X, Y)
that governs the rest of the data set.
In active learning, it is assumed to be expensive to acquire a label y(i) (imagine asking a
human to read an x-ray image), so the learning algorithm can sequentially ask for particular
inputs x(i) to be labeled, and must carefully select queries in order to learn as effectively as
possible while minimizing the cost of labeling.
In transfer learning (also called meta-learning), there are multiple tasks, with data drawn
from different, but related, distributions. The goal is for experience with previous tasks to
apply to learning a current task in a way that requires decreased experience with the new
task.

2 Assumptions
The kinds of assumptions that we can make about the data source or the solution include:

• The data are independent and identically distributed.

• The data are generated by a Markov chain.

• The process generating the data might be adversarial.

• The “true” model that is generating the data can be perfectly described by one of
some particular set of hypotheses.

The effect of an assumption is often to reduce the “size” or “expressiveness” of the space of
possible hypotheses and therefore reduce the amount of data required to reliably identify
an appropriate hypothesis.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 8

3 Evaluation criteria
Once we have specified a problem class, we need to say what makes an output or the an-
swer to a query good, given the training data. We specify evaluation criteria at two levels:
how an individual prediction is scored, and how the overall behavior of the prediction or
estimation system is scored.
The quality of predictions from a learned model is often expressed in terms of a loss
function. A loss function L(g, a) tells you how much you will be penalized for making a
guess g when the answer is actually a. There are many possible loss functions. Here are
some frequently used examples:

• 0-1 Loss applies to predictions drawn from finite domains. If the actual values are
 drawn from a contin-
uous distribution, the
0 if g = a
L(g, a) = probability they would
1 otherwise ever be equal to some
predicted g is 0 (except
for some weird cases).
• Squared loss
L(g, a) = (g − a)2

• Linear loss
L(g, a) = |g − a|

• Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict “no” when the
answer is really “yes”, than the other way around.


1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1


0 otherwise

Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:

• Minimizing expected loss over all the predictions (also known as risk)

• Minimizing maximum loss: the loss of the worst prediction

• Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class

• Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data

• Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.

There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. Expected loss is also sometimes called risk in the Of course, there are
machine-learning literature, but that term means other things in economics or other parts other models for ac-
tion selection and it’s
of decision theory, so be careful...it’s risky to use it. We will, most of the time, concentrate clear that people do not
on this criterion. always (or maybe even
often) select actions that
follow this rule.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 9

4 Model type
Recall that the goal of a machine-learning system is typically to estimate or generalize,
based on data provided. Below, we examine the role of model-making in machine learning.

4.1 No model
In some simple cases, in response to queries, we can generate predictions directly from
the training data, without the construction of any intermediate model. For example, in
regression or classification, we might generate an answer to a new query by averaging
answers to recent queries, as in the nearest neighbor method.

4.2 Prediction rule


This two-step process is more typical:

1. “Fit” a model to the training data

2. Use the model directly to make predictions

In the prediction rule setting of regression or classification, the model will be some hy-
pothesis or prediction rule y = h(x; θ) for some functional form h. The idea is that θ is
a vector of one or more parameter values that will be determined by fitting the model to
the training data and then be held fixed. Given a new x(n+1) , we would then make the
prediction h(x(n+1) ; θ). We write f(a; b) to de-
The fitting process is often articulated as an optimization problem: Find a value of θ scribe a function that is
usually applied to a sin-
that minimizes some criterion involving θ and the data. An optimal strategy, if we knew gle argument a, but is a
the actual underlying distribution on our data, Pr(X, Y) would be to predict the value of member of a paramet-
y that minimizes the expected loss, which is also known as the test error. If we don’t have ric family of functions,
that actual underlying distribution, or even an estimate of it, we can take the approach with the particular func-
tion determined by pa-
of minimizing the training error: that is, finding the prediction rule h that minimizes the rameter value b. So,
average loss on our training data set. So, we would seek θ that minimizes for example, we might
write h(x; p) = xp to
1X
n describe a function of a
En (θ) = L(h(x(i) ; θ), y(i) ) , single argument that is
n parameterized by p.
i=1

where the loss function L(g, a) measures how bad it would be to make a guess of g when
the actual value is a.
We will find that minimizing training error alone is often not a good choice: it is possible
to emphasize fitting the current data too strongly and end up with a hypothesis that does
not generalize well when presented with new x values.

5 Model class and parameter fitting


A model class M is a set of possible models, typically parameterized by a vector of param-
eters Θ. What assumptions will we make about the form of the model? When solving a
regression problem using a prediction-rule approach, we might try to find a linear func-
tion h(x; θ, θ0 ) = θT x + θ0 that fits our data well. In this example, the parameter vector
Θ = (θ, θ0 ).
For problem types such as discrimination and classification, there are huge numbers of
model classes that have been considered...we’ll spend much of this course exploring these
model classes, especially neural networks models. We will almost completely restrict our

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 10

attention to model classes with a fixed, finite number of parameters. Models that relax this
assumption are called “non-parametric” models.
How do we select a model class? In some cases, the machine-learning practitioner will
have a good idea of what an appropriate model class is, and will specify it directly. In other
cases, we may consider several model classes. In such situations, we are solving a model
selection problem: model-selection is to pick a model class M from a (usually finite) set of
possible model classes; model fitting is to pick a particular model in that class, specified by
parameters θ.

6 Algorithm
Once we have described a class of models and a way of scoring a model given data, we
have an algorithmic problem: what sequence of computational instructions should we run
in order to find a good model from our class? For example, determining the parameter
vector θ which minimizes En (θ) might be done using a familiar least-squares minimization
algorithm, when the model h is a function being fit to some data x.
Sometimes we can use software that was designed, generically, to perform optimiza-
tion. In many other cases, we use algorithms that are specialized for machine-learning
problems, or for particular hypotheses classes.
Some algorithms are not easily seen as trying to optimize a particular criterion. In fact,
the first algorithm we study for finding linear classifiers, the perceptron algorithm, has this
character.

Last Updated: 02/01/19 01:38:02


CHAPTER 2

Linear classifers

1 Classification
A binary classifier is a mapping from Rd → {−1, +1}. We’ll often use the letter h (for Actually, general classi-
hypothesis) to stand for a classifier, so the classification process looks like: fiers can have a range
which is any discrete
x→ h →y . set, but we’ll work with
this specific case for a
Real life rarely gives us vectors of real numbers; the x we really want to classify is while.
usually something like a song, image, or person. In that case, we’ll have to define a function
ϕ(x), whose domain is Rd , where ϕ represents features of x, like a person’s height or the
amount of bass in a song, and then let the h : ϕ(x) → {−1, +1}. In much of the following,
we’ll omit explicit mention of ϕ and assume that the x(i) are in Rd , but you should always
have in mind that some additional process was almost surely required to go from the actual
input examples to their feature representation.
In supervised learning we are given a training data set of the form

   
Dn = x(1) , y(1) , . . . , x(n) , y(n) .

We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? That it works well on new data; that is, that it makes
good predictions on examples it hasn’t seen. But we don’t know exactly what data this My favorite analogy
classifier might be tested on when we use it in the real world. So, we have to assume a is to problem sets. We
evaluate a student’s
connection between the training data and testing data; typically, they are drawn indepen- ability to generalize by
dently from the same probability distribution. putting questions on the
Given a training set Dn and a classifier h, we can define the training error of h to be exam that were not on
 the homework (training
1 X 1 h(x(i) ) 6= y(i)
n
set).
En (h) = .
n 0 otherwise
i=1

For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error

1 X 1 h(x(i) ) 6= y(i)
n+n 0
E(h) = 0
n 0 otherwise
i=n+1

11
MIT 6.036 Fall 2018 12

on n 0 new examples that were not used in the process of finding the classifier.

2 Learning algorithm
A hypothesis class H is a set (finite or infinite) of possible classifiers, each of which represents
a mapping from Rd → {−1, +1}.
A learning algorithm is a procedure that takes a data set Dn as input and returns an
element h of H; it looks like

Dn −→ learning alg (H) −→ h

We will find that the choice of H can have a big impact on the test error of the h that
results from this process. One way to get h that generalizes well is to restrict the size, or
“expressiveness” of H.

3 Linear classifiers
We’ll start with the hypothesis class of linear classifiers. They are (relatively) easy to un-
derstand, simple in a mathematical sense, powerful on their own, and the basis for many
other more sophisticated methods.
A linear classifier in d dimensions is defined by a vector of parameters θ ∈ Rd and
scalar θ0 ∈ R. So, the hypothesis class H of linear classifiers in d dimensions is the set of all
vectors in Rd+1 . We’ll assume that θ is an n × 1 column vector.
Given particular values for θ and θ0 , the classifier is defined by Let’s be careful about
 dimensions. We have
assumed that x and θ
T +1 if θT x + θ0 > 0
h(x; θ, θ0 ) = sign(θ x + θ0 ) = . are both n × 1 column
−1 otherwise vectors. So θT x is 1 × 1,
which in math (but not
Remember that we can think of θ, θ0 as specifying a hyperplane. It divides Rd , the space necessarily numpy) is
the same as a scalar.
our x(i) points live in, into two half-spaces. The one that is on the same side as the normal
vector is the positive half-space, and we classify all points in that space as positive. The
half-space on the other side is negative and all points in it are classified as negative.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 13

 
−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5
 
3
The diagram below shows several points classified by h. In particular, let x(1) =
2
 
4
and x(2) = .
−1
   
(1)
  3
h(x ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2
   
 4
h(x(2) ; θ, θ0 ) = sign −1 1.5

+ 3 = sign(−2.5) = −1
−1

Thus, x(1) and x(2) are given positive and negative classfications, respectively.

x(1)
θT x + θ0 = 0

x(2)

Study Question: What is green vector normal to the hyperplane? Specify it as a col-
umn vector.

Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?

4 Learning linear classifiers


Now, given a data set and the hypothesis class of linear classifiers, our objective will be to
find the linear classifier with the smallest possible training error.
This is a well-formed optimization problem. But it’s not computationally easy!
We’ll start by considering a very simple learning algorithm. The idea is to generate It’s a good idea to think
k possible hypotheses by generating their parameter vectors at random. Then, we can of the “stupidest possi-
ble” solution to a prob-
evaluate the training-set error on each of the hypotheses and return the hypothesis that lem, before trying to get
has the lowest training error (breaking ties arbitrarily). clever. Here’s a fairly
(but not completely)
stupid algorithm.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 14

R ANDOM -L INEAR -C LASSIFIER(Dn , k, d)


1 for j = 1 to k  
(j)
2 randomly sample θ(j) , θ0 from (Rd , R)
 
(j)
3 j∗ = arg minj∈{1,...,k} En θ(j) , θ0
 ∗ 
(j∗ )
4 return θ(j ) , θ0

A note about notation. This might be new no-


tation: arg minx f(x)
Study Question: What do you think happens to En (h), where h is the hypothesis means the value of x
returned by R ANDOM -L INEAR -C LASSIFIER, as k is increased? for which f(x) is the
smallest. Sometimes we
Study Question: What properties of Dn do you think will have an effect on En (h)? write arg minx∈X f(x)
when we want to ex-
plicitly specify the set
X of values of x over
which we want to mini-
5 Evaluating a learning algorithm mize.

How should we evaluate the performance of a classifier h? The best method is to measure
test error on data that was not used to train it.
How should we evaluate the performance of a learning algorithm? This is trickier. There
are many potential sources of variability in the possible result of computing test error on a
learned hypothesis h:

• Which particular training examples occurred in Dn

• Which particular testing examples occurred in Dn 0

• Randomization inside the learning algorithm itself

Generally, we would like to execute the following process multiple times:

• Train on a new training set

• Evaluate resulting h on a testing set that does not overlap the training set

Doing this multiple times controls for possible poor choices of training set or unfortunate
randomization inside the algorithm itself.
One concern is that we might need a lot of data to do this, and in many applications
data is expensive or difficult to acquire. We can re-use data with cross validation (but it’s
harder to do theoretical analysis).

C ROSS -VALIDATE(D, k)
1 divide D into k chunks D1 , D2 , . . . Dk (of roughly equal size)
2 for i = 1 to k
3 train hi on D \ Di (withholding chunk Di )
4 compute “test” error Ei (hi ) on withheld data Di
P
5 return k1 ki=1 Ei (hi )

It’s very important to understand that cross-validation neither delivers nor evaluates a
single particular hypothesis h. It evaluates the algorithm that produces hypotheses.

Last Updated: 02/01/19 01:38:02


CHAPTER 3

The Perceptron

First of all, the coolest algorithm name! It is based on the 1943 model of neurons made by Well, maybe “neocogni-
McCulloch and Pitts and by Hebb. It was developed by Rosenblatt in 1962. At the time, tron,” also the name of
a real ML algorithm, is
it was not interpreted as attempting to optimize any particular criteria; it was presented cooler.
directly as an algorithm. There has, since, been a huge amount of study and analysis of its
convergence properties and other aspects of its behavior.

1 Algorithm
Recall that we have a training dataset Dn with x ∈ Rd , and y ∈ {−1, +1}. The Perceptron
algorithm trains a binary classifier h(x; θ, θ0 ) using the following algorithm to find θ and
θ0 using τ iterative steps: We use Greek letter τ
here instead of T so we
don’t confuse it with
P ERCEPTRON(τ, Dn ) transpose!
 T
1 θ = 0 0 ··· 0
2 θ0 = 0
3 for t = 1 to τ
4 for i = 1 to n 
5 if y(i) θT x(i) + θ0 6 0
6 θ = θ + y(i) x(i)
7 θ0 = θ0 + y(i)
8 return θ, θ0

Let’s check dimensions.


Intuitively, on each step, if the current hypothesis θ, θ0 classifies example x(i) correctly, Remember that θ is
n × 1, x(i) is n × 1, and
then no change is made. If it classifies x(i) incorrectly, then it moves θ, θ0 so that it is y(i) is a scalar. Does
“closer” to classifying x(i) , y(i) correctly. everything match?
Note that if the algorithm ever goes through one iteration of the loop on line 4 without
making an update, it will never make any further updates (verify that you believe this!)
and so it should just terminate at that point.
Study Question: What is true about En if that happens?

15
MIT 6.036 Fall 2018 16


(0) 1 (0)
Example: Let h be the linear classifier defined by θ = , θ0 = 1. The dia-
−1
gram below shows several points classified by h. However,
  in this case, h (repre-
1
sented by the bold line) misclassifies the point x(1) = which has label y(1) = 1.
3
Indeed,  
(1)

T (1)
   1
y θ x + θ0 = 1 −1 + 1 = −1 < 0
3
By running an iteration of the Perceptron algorithm, we update
 
2
θ(1) = θ(0) + y(1) x(1) =
2

(0)
θ0 1 = θ0 + y(1) = 2
The new classifier (represented by the dashed line) now correctly classifies each
point.

T (0)
θ(0) x + θ0 = 0

x(1)

θ(1)

θ(0)

T (1)
θ(1) x + θ0 = 0

A really important fact about the perceptron algorithm is that, if there is a linear classi-
fier with 0 training error, then this algorithm will (eventually) find it! We’ll look at a proof
of this in detail, next.

2 Offset
Sometimes, it can be easier to implement or analyze classifiers of the form

+1 if θT x > 0
h(x; θ) =
−1 otherwise.

Without an explicit offset term (θ0 ), this separator must pass through the origin, which may
appear to be limiting. However, we can convert any problem involving a linear separator
with offset into one with no offset (but of higher dimension)!

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 17

 
Consider the d-dimensional linear separator defined by θ = θ1 θ2 ··· θd and
offset θ0 .

• to each data point x ∈ D, append a coordinate with value +1, yielding


 T
xnew = x1 ··· xd +1

• define  
θnew = θ1 ··· θd θ0

Then,

θTnew · xnew = θ1 x1 + · · · + θd xd + θ0 · 1
= θT x + θ0

Thus, θnew is an equivalent ((d + 1)-dimensional) separator to our original, but with no
offset.
Consider the data set:

X = [[1], [2], [3], [4]]


Y = [[+1], [+1], [−1], [−1]]

It is linearly separable in d = 1 with θ = [−1] and θ0 = 2.5. But it is not linearly separable
through the origin! Now, let
       
1 2 3 4
Xnew =
1 1 1 1

This new dataset is separable through the origin, with θnew = [−1, 2.5]T .
We can make a simplified version of the perceptron algorithm if we restrict ourselves
to separators through the origin: We list it here because
this is the version of the
P ERCEPTRON -T HROUGH -O RIGIN(τ, Dn ) algorithm we’ll study in
 T more detail.
1 θ = 0 0 ··· 0
2 for t = 1 to τ
3 for i = 1 to n 
4 if y(i) θT x(i) 6 0
5 θ = θ + y(i) x(i)
6 return θ

3 Theory of the perceptron


Now, we’ll say something formal about how well the perceptron algorithm really works.
We start by characterizing the set of problems that can be solved perfectly by the perceptron
algorithm, and then prove that, in fact, it can solve these problems. In addition, we provide
a notion of what makes a problem difficult for perceptron and link that notion of difficulty
to the number of iterations the algorithm will take.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 18

3.1 Linear separability


A training set Dn is linearly separable if there exist θ, θ0 such that, for all i = 1, 2, . . . , n:
 
y(i) θT x(i) + θ0 > 0 .

Another way to say this is that all predictions on the training set are correct:

h(x(i) ; θ, θ0 ) = y(i) .

And, another way to say this is that the training error is zero:

En (h) = 0 .

3.2 Convergence theorem


The basic result about the perceptron is that, if the training data Dn is linearly separable,
then the perceptron algorithm is guaranteed to find a linear separator. If the training data is
We will more specifically characterize the linear separability of the dataset by the margin not linearly separable,
the algorithm will not
of the separator. We’ll start by defining the margin of a point with respect to a hyperplane. be able to tell you for
First, recall that the distance of a point x to the hyperplane θ, θ0 is sure, in finite time, that
it is not linearly sepa-
θT x + θ0 rable. There are other
. algorithms that can
kθk
test for linear separa-
bility with run-times
Then, we’ll define the margin of a labeled point (x, y) with respect to hyperplane θ, θ0 to be
O(nd/2 ) or O(d2n ) or
O((nd−1 log n).
θT x + θ0
y· .
kθk

This quantity will be positive if and only if the point x is classified as y by the linear classi-
fier represented by this hyperplane.
Study Question: What sign does the margin have if the point is incorrectly classi-
fied? Be sure you can explain why.
Now, the margin of a dataset Dn with respect to the hyperplane θ, θ0 is the minimum
margin of any point with respect to θ, θ0 :
T (i)
 
(i) θ x + θ0
min y · .
i kθk

The margin is positive if and only if all of the points in the data-set are classified correctly.
In that case (only!) it represents the distance from the hyperplane to the closest point.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 19

 
1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1

The diagram below shows several points classified by h, one of which is misclassi-
fied. We compute the margin for each point:

θT x + θ0 = 0

x(1)

x(3) x(2)


(1) θT x(1) + θ0 −2 + 1 2
y · =1· √ =−
kθk 2 2

θT x(2) + θ0 1+1 √
y(2) · =1· √ = 2
kθk 2
T (3)
θ x + θ0 −3 + 1 √
y(3) · = −1 · √ = 2
kθk 2

Note that since point x(1) is misclassified,



its margin is negative. Thus the margin
2
for the whole data set is given by − 2 .

Theorem 3.1 (Perceptron Convergence). For simplicity, we consider the case where the linear
separator must pass through the origin. If the following conditions hold,
∗T (i)
(a) there exists θ∗ such that y(i) θ kθx∗ k > γ for all i = 1, . . . , n and for some γ > 0

(b) all the examples have bounded magnitude: x(i) 6 R for all i = 1, . . . n
 2
then the perceptron algorithm will make at most R γ mistakes.

Proof. We initialize θ(0) = 0, and let θ(k) define our hyperplane after the perceptron algo-
rithm has made k mistakes. We are going to think about the angle between the hypothesis
we have now, θ(k) and the assumed good separator θ∗ . Since they both go through the ori-
gin, if we can show that the angle between them is decreasing usefully on every iteration,
then we will get close to that separator.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 20

So, let’s think about the cos of the angle between them, and recall, by the definition of
dot product:
  θ(k) · θ∗
cos θ(k) , θ∗ = ∗
kθ k θ(k)
We’ll divide this up into two factors,
!
   θ(k) · θ∗  1
(k) ∗
cos θ , θ = · , (3.1)
kθ∗ k θ(k)

and start by focusing on the first factor.


th th
Without loss of generality, assume that the k mistake occurs on the i example
(i) (i)
x ,y .

θ(k−1) + y(i) x(i) · θ∗



θ(k) · θ∗
=
kθ∗ k kθ∗ k
θ(k−1) · θ∗ y(i) x(i) · θ∗
= +
kθ∗ k kθ∗ k
θ(k−1) · θ∗
> +γ
kθ∗ k
> kγ

where we have first applied the margin condition from (a) and then applied simple induc-
tion. 
(i) (i)
Now, we’ll look at thesecond factor
 in equation 3.1. We note that since x , y is
T
classified incorrectly, y(i) θ(k−1) x(i) 6 0. Thus,

(k) 2 (k−1)
2
θ = θ + y(i) x(i)

2 T
2
= θ(k−1) + 2y(i) θ(k−1) x(i) + x(i)

2
6 θ(k−1) + R2

6 kR2

where we have additionally applied the assumption from (b) and then again used simple
induction.
Returning to the definition of the dot product, we have

θ(k) · θ∗ √ γ
 (k) ∗ 
  θ ·θ 1 1
cos θ(k) , θ∗ =

= ∗
> (kγ) · √ = k·
(k)
θ kθ k kθ k (k)
θ kR R

Since the value of the cosine is at most 1, we have


√ γ
1> k·
R
 2
R
k6 .
γ

This result endows the margin γ of Dn with an operational meaning: when using the
Perceptron algorithm for classification, at most (R/γ)2 classification errors will be made,
where R is an upper bound on the magnitude of the training vectors.

Last Updated: 02/01/19 01:38:02


CHAPTER 4

Feature representation

Linear classifiers are easy to work with and analyze, but they are a very restricted class of
hypotheses. If we have to make a complex distinction in low dimensions, then they are
unhelpful.
Our favorite illustrative example is the “exclusive or” (XOR) data set, the drosophila of D. Melanogaster is a
machine-learning data sets: species of fruit fly, used
as a simple system in
which to study genetics,
since 1910.

There is no linear separator for this two-dimensional dataset! But, we have a trick
available: take a low-dimensional data set and move it, using a non-linear transformation
into a higher-dimensional space, and look for a linear separator there. Let’s look at an
example data set that starts in 1-D:

These points are not linearly separable, but consider the transformation φ(x) = [x, x2 ]. What’s a linear separa-
Putting the data in φ space, we see that it is now separable. There are lots of possible tor for data in 1D? A
point!
separators; we have just shown one of them here.

21
MIT 6.036 Fall 2018 22

x2

separator

A linear separator in φ space is a nonlinear separator in the original space! Let’s see
how this plays out in our simple example. Consider the separator x2 − 1 = 0, which labels
the half-plane x2 − 1 > 0 as positive. What separator does it correspond to in the original
1-D space? We have to ask the question: which x values have the property that x2 − 1 = 0.
The answer is +1 and −1, so those two points constitute our separator, back in the original
space. And we can use the same reasoning to find the region of 1D space that is labeled
positive by this separator.

x
-1 1
0

This is a very general and widely useful strategy. It’s the basis for kernel methods, a
powerful technique that we won’t study in this class, and can be seen as a motivation for
multi-layer neural networks.
There are many different ways to construct φ. Some are relatively systematic and do-
main independent. We’ll look at the polynomial basis in section 1 as an example of that.
Others are directly related to the semantics (meaning) of the original features, and we con-
struct them deliberately with our domain in mind. We’ll explore that strategy in section 2.

1 Polynomial basis
If the features in your problem are already naturally numerical, one systematic strategy for
constructing a new feature space is to use a polynomial basis. The idea is that, if you are
using the kth-order basis (where k is a positive integer), you include a feature for every
possible product of k different dimensions in your original input.
Here is a table illustrating the kth order polynomial basis for different values of k.
Order d=1 in general
0 [1] [1]
1 [1, x] [1, x1 , . . . , xd ]
2 [1, x, x2 ] [1, x1 , . . . , xd , x21 , x1 x2 , . . .]
3 [1, x, x , x ] [1, x1 , . . . , x21 , x1 x2 , . . . , x1 x2 x3 , . . .]
2 3

.. .. ..
. . .
So, what if we try to solve the XOR problem using a polynomial basis as the feature
transformation? We can just take our two-dimensional data and transform it into a higher-

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 23

dimensional data set, by applying φ. Now, we have a classification problem as usual, and
we can use the perceptron algorithm to solve it.
Let’s try it for k = 2 on our XOR problem. The feature transformation is

φ((x1 , x2 )) = (1, x1 , x2 , x21 , x1 x2 , x22 ) .

Study Question: If we use perceptron to train a classifier after performing this fea-
ture transformation, would we lose any expressive power if we let θ0 = 0 (i.e. trained
without offset instead of with offset)?
After 4 iterations, perceptron finds a separator with coefficients θ = (0, 0, 0, 0, 4, 0) and
θ0 = 0. This corresponds to

0 + 0x1 + 0x2 + 0x21 + 4x1 x2 + 0x22 + 0 = 0

and is plotted below, with the shaded and un-shaded regions showing the classification
results:

Study Question: Be sure you understand why this high-dimensional hyperplane is


a separator, and how it corresponds to the figure.
For fun, we show some more plots below. Here is the result of running perceptron
on XOR, but where the data are put in a different place on the plane. After 65 mistakes
(!) it arrives at these coefficients: θ = (1, −1, −1, −5, 11, −5), θ0 = 1, which generate this
separator: The jaggedness in the
plotting of the separator
is an artifact of a lazy
lpk strategy for mak-
ing these plots–the true
curves are smooth.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 24

Study Question: It takes many more iterations to solve this version. Apply knowl-
edge of the convergence properties of the perceptron to understand why.
Here is a harder data set. After 200 iterations, we could not separate it with a second or
third-order basis representation. Shown below are the results after 200 iterations for bases
of order 2, 3, 4, and 5.

2 Hand-constructing features for real domains


In many machine-learning applications, we are given descriptions of the inputs with many
different types of attributes, including numbers, words, and discrete features. An impor-

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 25

tant factor in the success of an ML application is the way that the features are chosen to be
encoded by the human who is framing the learning problem.

2.1 Discrete features


Getting a good encoding of discrete features is particularly important. You want to create
“opportunities” for the ML system to find the underlying regularities. Although there are
machine-learning methods that have special mechanisms for handling discrete inputs, all
the methods we consider in this class will assume the input vectors x are in Rd . So, we
have to figure out some reasonable strategies for turning discrete values into (vectors of)
real numbers.
We’ll start by listing some encoding strategies, and then work through some examples.
Let’s assume we have some feature in our raw data that can take on one of k discrete values.

• Numeric Assign each of these values a number, say 1.0/k, 2.0/k, . . . , 1.0. We might
want to then do some further processing, as described in section 8.3. This is a sensible
strategy only when the discrete values really do signify some sort of numeric quantity,
so that these numerical values are meaningful.

• Thermometer code If your discrete values have a natural ordering, from 1, . . . , k, but
not a natural mapping into real numbers, a good strategy is to use a vector of length
k binary variables, where we convert discrete input value 0 < j 6 k into a vector in
which the first j values are 1.0 and the rest are 0.0. This does not necessarily imply
anything about the spacing or numerical quantities of the inputs, but does convey
something about ordering.

• Factored code If your discrete values can sensibly be decomposed into two parts (say
the “make” and “model” of a car), then it’s best to treat those as two separate features,
and choose an appropriate encoding of each one from this list.

• One-hot code If there is no obvious numeric, ordering, or factorial structure, then


the best strategy is to use a vector of length k, where we convert discrete input value
0 < j 6 k into a vector in which all values are 0.0, except for the jth, which is 1.0.

• Binary code It might be tempting for the computer scientists among us to use some
binary code, which would let us represent k values using a vector of length log k.
This is a bad idea! Decoding a binary code takes a lot of work, and by encoding your
inputs this way, you’d be forcing your system to learn the decoding algorithm.

As an example, imagine that we want to encode blood types, which are drawn from the
set {A+, A−, B+, B−, AB+, AB−, O+, O−}. There is no obvious linear numeric scaling or
even ordering to this set. But there is a reasonable factoring, into two features: {A, B, AB, O}
and {+, −1}. And, in fact, we can reasonably factor the first group into {A, notA}, {B, notB}
So, here are two plausible encodings of the whole set: It is sensible (according
to Wikipedia!) to treat
• Use a 6-D vector, with two dimensions to encode each of the factors using a one-hot O as having neither fea-
encoding. ture A nor feature B.

• Use a 3-D vector, with one dimension for each factor, encoding its presence as 1.0
and absence as −1.0 (this is sometimes better than 0.0). In this case, AB+ would be
(1.0, 1.0, 1.0) and O− would be (−1.0, −1.0, −1.0).

Study Question: How would you encode A+ in both of these approaches?

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 26

2.2 Text
The problem of taking a text (such as a tweet or a product review, or even this document!)
and encoding it as an input for a machine-learning algorithm is interesting and compli-
cated. Much later in the class, we’ll study sequential input models, where, rather than
having to encode a text as a fixed-length feature vector, we feed it into a hypothesis word
by word (or even character by character!).
There are some simpler encodings that work well for basic applications. One of them is
the bag of words (BOW) model. The idea is to let d be the number of words in our vocabulary
(either computed from the training set or some other body of text or dictionary). We will
then make a binary vector (with values 1.0 and 0.0) of length d, where element j has value
1.0 if word j occurs in the document, and 0.0 otherwise.

2.3 Numeric values


If some feature is already encoded as a numeric value (heart rate, stock price, distance, etc.)
then you should generally keep it as a numeric value. An exception might be a situation in
which you know there are natural “breakpoints” in the semantics: for example, encoding
someone’s age in the US, you might make an explicit distinction between under and over
18 (or 21), depending on what kind of thing you are trying to predict. It might make sense
to divide into discrete bins (possibly spacing them closer together for the very young) and
to use a one-hot encoding for some sorts of medical situations in which we don’t expect a
linear (or even monotonic) relationship between age and some physiological features.
If you choose to leave a feature as numeric, it is typically useful to scale it, so that it
tends to be in the range [−1, +1]. Without performing this transformation, if you have
one feature with much larger values than another, it will take the learning algorithm a lot
of work to find parameters that can put them on an equal basis. So, we might perform
x−x
transformation φ(x) = , where x is the average of the x(i) , and σ is the standard
σ
deviation of the x(i) . The resulting feature values will have mean 0 and standard deviation
1. This transformation is sometimes called standardizing a variable . Such standard variables
Then, of course, you might apply a higher-order polynomial-basis transformation to are often known as “z-
scores,” for example, in
one or more groups of numeric features. the social sciences.
Study Question: Percy Eptron has a domain with 4 numeric input features,
(x1 , . . . , x4 ). He decides to use a representation of the form

φ(x) = PolyBasis((x1 , x2 ), 3)_ PolyBasis((x3 , x4 ), 3)

where a_ b means the vector a concatenated with the vector b. What is the dimen-
sion of Percy’s representation? Under what assumptions about the original features is
this a reasonable choice?

Last Updated: 02/01/19 01:38:02


CHAPTER 5

Margin Maximization

1 Machine learning as optimization


The perceptron algorithm was originally written down directly via cleverness and intu-
ition, and later analyzed theoretically. Another approach to designing machine learning
algorithms is to frame them as optimization problems, and then use standard optimization
algorithms and implementations to actually find the hypothesis.
We begin by writing down an objective function J(θ), where θ stands for all the param-
eters in our model. We also often write J(θ; D) to make clear the dependence on the data
D. The objective function describes how we feel about possible hypotheses θ: we will
generally look for values for parameters θ that minimize the objective function: You can think about θ∗
here as “the theta that
θ∗ = arg min J(θ) . minimizes J”.
θ

A very common form for an ML objective is


 
1 Xn
J(θ) =  L(x(i) , y(i) , θ) + |{z}
λ R(θ) .
n | {z } |{z}
i=1 loss constant regularizer

The loss tells us how unhappy we are about the prediction h(x(i) ; θ) that θ makes for
(x(i) , y(i) ). A common example is the 0-1 loss, introduced in chapter 1:

0 if y = h(x; θ)
L01 (x, y, θ) =
1 otherwise
which gives a value of 0 for a correct prediction, and a 1 for an incorrect prediction. In the
case of linear separators, this becomes: We will sometimes
 write J(θ, θ0 ) because
0 if y(θT x + θ0 ) > 0 when studying linear
L01 (x, y, θ, θ0 ) = classifiers, we have used
1 otherwise these two names for our
whole collection of pa-
rameters.
2 Regularization
If all we cared about was finding a hypothesis with small loss on the training data, we
would have no need for regularization, and could simply omit the second term in the ob-

27
MIT 6.036 Fall 2018 28

jective. But remember that our objective is to perform well on input values that we haven’t
trained on! It may seem that this is an impossible task, but humans and machine-learning
methods do this successfully all the time. What allows generalization to new input values
is a belief that there is an underlying regularity that governs both the training and test-
ing data. We have already discussed one way to describe an assumption about such a
regularity, which is by choosing a limited class of possible hypotheses. Another way to
do this is to provide smoother guidance, saying that, within a hypothesis class, we prefer
some hypotheses to others. The regularizer articulates this preference and the constant λ
says how much we are willing to trade off loss on the training data versus preference over
hypotheses.
This trade-off is illustrated in the figure below. Hypothesis h1 has 0 loss, but is very
complicated. Hypothesis h2 mis-classifies two points, but is very simple. In absence of
other beliefs about the solution, it is often better to prefer that the solution be “simpler”,
and so we might prefer h2 over h1 , expecting it to perform better on future examples drawn
from this same distribution. Another nice way of thinking about regularization is that we To establish some vo-
would like to prevent our hypothesis from being too dependent on the particular training cabulary, we might say
that h1 is overfit to the
data that we were given: we would like for it to be the case that if the training data were training data.
changed slightly, the hypothesis would not change.

h1

h2

A common strategy for specifying a regularizer is to use the form


2
R(θ) = θ − θprior

when you have some idea in advance that θ ought to be near some value θprior . In the Learn about Bayesian
absence of such knowledge a default is to regularize toward zero: methods in machine
learning to see the the-
ory behind this and cool
R(θ) = kθk2 . results!

3 Maximizing the margin


One criterion to use for judging separators that classify all examples correctly (that is, that
have a 0-1 loss value of 0) is to prefer ones that have a large margin. Recall that the margin

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 29

of a labeled point (x, y) with respect to the hyperplane θ, θ0 is



y θT x + θ0
γ(x, y, θ, θ0 ) = .
kθk

The margin is positive if a point is classified correctly, and the absolute value of the margin
is the perpendicular distance of x to the hyperplane. The margin of a dataset D with respect
to θ, θ0 is
min γ(x(i) , y(i) , θ, θ0 ) .
(x(i) ,y(i) )∈D

The figure below illustrates the margin of two linear separators, each with 0 loss, with
respect to the same data set. The separator, h1 , with the larger margin γ1 feels intuitively
like it will have better generalization performance. Separator h2 has a very small margin
and would make errors if the data were perturbed by a small amount. There is a lot of inter-
esting theoretical work
y justifying a preference
for separators with large
h1 margin. There is an im-
portant class of algo-
rithms called support
h2 vector machines that fo-
cus on finding large-
γ1 γ2 x margin separators with
some other interesting
properties. Before the
current boom in neural
networks, they were the
“go-to” algorithms for
classification. We don’t
have time to study them
in more detail in this
class, but we recom-
mend that you read
If our data is linearly separable, we can frame the problem of finding the maximum- about them some time.
margin separator as
θ∗ , θ∗0 = arg max min γ(x(i) , y(i) , θ, θ0 ) ,
θ,θ0 i

which we can find by minimizing the objective

J(θ, θ0 ) = − min γ(x(i) , y(i) , θ, θ0 ) .


i

However, this form of the objective can be tricky to optimize, and is not useful when the
data are non-separable.
We will develop another possible formulation. Let’s start by assuming that we know a
good value for the target margin, γref > 0. We would like

• All points to have margin > γref and

• the target margin γref to be big.

To make this more precise, we start by defining a new loss function called the hinge loss:

1 − γ/γref if γ < γref
Lh (γ/γref ) = .
0 otherwise

The figure below and to the left plots the hinge loss as a function of the margin, γ, of a point.
It is zero if the margin is greater than γref and increases linearly as the margin decreases.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 30

loss = 0 loss = 1.5


L

incorrect
γ θ, θ0
γref
correct but margin < γref γref loss = 0.5
loss = 0
γref

Study Question: Plot 0-1 loss on the axes of the figure on the left.
In the figure on the right, we illustrate a hyperplane θ, θ0 . Parallel to that hyperplane,
but offset in either direction by γref are two more hyperplanes (dotted lines in the figure)
representing the margins. Any correctly classified point outside the margin has loss 0. Any
correctly classified point inside the margin has loss in (0, 1). Any incorrectly classified point
has loss > 1.
Study Question: Be sure you can compute the loss values shown on this figure.
Also, try to visualize a plot in 3 dimensions that shows the loss function for positive
points and the loss function for negative points going in the Z dimension (up out of
the page) on top of this plot.
Now, let’s formulate an objective function, in terms of the parameters (θ, θ0 , γref ). We
will express our desire for a large margin by employing a regularization term in the objec-
tive function, namely R(θ, θ0 , γref ) = γ12 . Then, we have If you are thinking to
ref
yourself that we could
express the desire for
1X
n
!
γ(x(i) , y(i) , θ, θ0 )
 
1 large margin by setting
J(θ, θ0 , γref ) = Lh +λ . R = −γref or R = 1/γref
n γref γ2ref
i=1 or any of a variety of
other things, you would
We see that the two terms in the objective function have opposing effects, favoring small be right. We’ll insist on
and large γref , respectively, with λ governing the trade-off. this choice for now, be-
cause it will turn out
Study Question: You should, by now, be asking yourself “How do we pick λ?” You to have useful conse-
can think of the different objective functions that result from different choices of hy- quences later, but you
have no way of seeing
perparameter λ as different learning algorithms. What do you know about choosing that at this point.
among algorithms? How would you use that to pick λ?
Because it’s not a pa-
Now, in order to slightly simplify our problem and to connect up with more standard rameter of the hypothe-
existing methods, we are going to make a somewhat unintuitive step. We actually have one sis, but rather a param-
more parameter in our set of parameters (θ, θ0 , γref ) than is really necessary for describing eter of the method we
this objective, so we are going to rewrite it. are using to choose the
hypothesis, we call λ a
Remember that any linear scaling of θ, θ0 represents the same separator. So, without hyperparameter.
losing any ability to represent separators or describe their margins, we can scale θ, θ0 so
that This is kind of tricky.
1 Stew about it until it
kθk = . makes sense to you.
γref
Note that, because our target margin scales inversely with kθk, wanting the margin to be It’s like we’re secretly
large is the same as wanting kθk to be small. encoding our target
margin as the inverse
of the norm of θ.
Last Updated: 02/01/19 01:38:02
MIT 6.036 Fall 2018 31

With this trick, we don’t need γref as a parameter any more, and we get the following
objective, which is the objective for support vector machines, and we’ll often refer to as the
SVM objective:
1 X  (i) T
n
!

J(θ, θ0 ) = Lh y (θ x + θ0 ) + λ kθk2 .
n
i=1

Here are some observations:

1. If λ = 0, for any separator that correctly classifies all the points, θ, θ0 can always be
chosen so that the objective function evaluates to 0.

2. If λ > 0 but is very small, we will pick θ with smallest kθk while still maintaining
seperation of the data.

3. If λ is large, we tolerate errors in favor of having a “simpler” (smaller norm) separa-


tor.

Study Question: Be sure you can make a detailed explanation of each of these
points. In point 1 above, would we need increase or decrease the magnitude of θ to
make the objective go to zero? How does point 3 relate to the idea of regularizing
toward zero?
At the optimum, for separable data with very small λ:

• y(i) θT x(i) + θ0 > 1 for all i, since the hinge loss evaluates to 0.

• y(i) θT x(i) + θ0 = 1 for at least one i, because the regularizer will drive kθk to be as
small as possible with the loss still remaining at 0.

Points for which y(i) θT x(i) + θ0 = 1 have margin exactly equal to γref (otherwise we
could decrease kθk to obtain a larger margin). For these points that lie along the margin,
we use their distance to θ, θ0 to compute the margin of this separator with respect to the
data set: 
y(i) θT x(i) + θ0
= margin ,
kθk
and so
1
margin = .
kθk
Note that this last asser-
tion (that the margin is
Study Question: Be sure you can derive this last step! the inverse of the norm
of θ) is only true under
the assumptions listed
at the top of this para-
graph: when the data
is separable, when λ is
very small, and when θ
is the optimum of the
SVM objective.

Last Updated: 02/01/19 01:38:02


CHAPTER 6

Gradient Descent

Now we have shown how to describe an interesting objective function for machine learn-
ing, but we need a way to find the optimal θ∗ = arg minθ J(θ)? There is an enormous,
fascinating, literature on the mathematical and algorithmic foundations of optimization, Which you should con-
but for this class, we will consider one of the simplest methods, called gradient descent. sider studying some
day!
Intuitively, in one or two dimensions, we can easily think of J(θ) as defining a surface
over θ; that same idea extends to higher dimensions. Now, our objective is to find the
θ value at the lowest point on that surface. One way to think about gradient descent is
that you start at some arbitrary point on the surface, look to see in which direction the
“hill” goes down most steeply, take a small step in that direction, determine the direction
of steepest descent from where you are, take another small step, etc. Here’s a very old-school
humorous description
of gradient descent and
other optimization algo-
1 One dimension rithms using analogies
involving kangaroos:
We will start by considering gradient descent in one dimension. Assume θ ∈ R, and that ftp://ftp.sas.com/
we know both J(θ) and its first derivative with respect to θ, J 0 (θ). Here is pseudocode for pub/neural/kangaroos.txt

gradient descent on an arbitrary function f. Along with f and f 0 , we have to specify the
initial value for parameter θ, a step-size parameter η, and an accuracy parameter :

1D-G RADIENT-D ESCENT(θinit , η, f, f 0 , )


1 θ(0) = θinit
2 t=0
3 repeat
4 t = t+1
5 θ(t) = θ(t−1) − η f 0 (θ(t−1) )
6 until |f 0 (θ(t) | < 
7 return θ(t)

Note
(t) that there
are many other
reasonable ways
to decide to terminate, including when
θ − θ(t−1) <  or when f(θ(t) ) − f(θ(t−1) ) < .
If J is convex, for any desired accuracy , there is some step size η such that gradient
descent will converge to within  of the optimal θ. However, we must be careful when Woo hoo! We have a
choosing the step size to prevent slow convergence, oscillation around the minimum, or convergence guarantee,
of sorts
divergence.

32
MIT 6.036 Fall 2018 33

The following plot illustrates a convex function f(x) = (x−2)2 , starting gradient descent
at θinit = 4.0 with a step-size of 1/2. It is very well-behaved!
y

x
−1 1 2 3 4 5 6

Study Question: What happens in this example with very small η? With very big η?
If J is non-convex, where gradient descent converges to depends on θinit . When it
reaches a value of θ where f 0 (θ) = 0 and f 00 (θ) > 0, but it is not a minimum of the function,
it is called a local minimum or local optimum.
10 y

x
−2 −1 1 2 3 4

2 Multiple dimensions
The extension to the case of multi-dimensional θ is straightforward. Let’s assume θ ∈ Rm ,
so J : Rm → R. The gradient of J with respect to θ is
 
∂J/∂θ1
∇θ J = 
 .. 
. 
∂J/∂θm
The algorithm remains the same, except that the update step in line 5 becomes
θ(t) = θ(t−1) − η∇θ J
and we have to(t)
change the termination
criterion. The easiest thing is to replace the test in
(t−1)
line 6 with f(θ ) − f(θ
) < , which is sensible no matter the dimensionality of θ.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 34

3 Application to SVM objective


There are two slight “wrinkles” involved in applying gradient descent to the SVM objec-
tive.
We begin by stating the objective and the gradient necessary for doing gradient descent.
In our problem, the entire parameter vector is described by parameter vector θ and scalar
θ0 and so we will have to adjust them both and compute gradients of J with respect to
each of them. The objective and gradient (note we have replaced the constant λ with λ2 for
convenience), are The following step re-
quires passing familiar-
1 X  (i) T (i)
n  λ ity with matrix deriva-
J(θ, θ0 ) = Lh y (θ x + θ0 ) + kθk2 tives. A foolproof way
n 2 of computing them is to
i=1
compute partial deriva-
1 X 0  (i) T (i)
n 
∇θ J = Lh y (θ x + θ0 ) y(i) x(i) + λθ tive of J with respect to
n each component θi of θ.
i=1

Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1.
Recall the hinge-loss 
1−v if v < 1
Lh (v) = .
0 otherwise
This loss is not differentiable, since the derivative at v = 1 doesn’t exist! So we consider the
subgradient  You don’t have to really
−1 if v < 1 understand the idea of
0 a subgradient, just that
Lh (v) = .
0 otherwise it has value 0 at v = 1
here.
This gives us a complete definition of ∇θ J. But we also have to go down the gradient with
respect to θ0 , so we find

1 X 0  (i) T
n
∂J 
= Lh y (θ x + θ0 ) y(i) .
∂θ0 n
i=1

Finally, our gradient descent algorithm becomes

SVM-G RADIENT-D ESCENT(θinit , θ0init , η, J)


1 θ(0) = θinit
(0)
2 θ0 = θ0init
3 t=0
4 repeat
5 t = t+1  
 −1 if y(i) θ(t−1)T x(i) + θ(t−1) < 1
   
P n 0
6 θ(t) = θ(t−1) − η  n1 i=1 y(i) x(i) + λθ(t−1) 
 0 otherwise 
 
 −1 if y(i) θ(t−1)T x(i) + θ(t−1) < 1
   
(t) (t−1) P n 0
7 θ0 = θ0 − η  n1 i=1 y(i) 
 0 otherwise 

(t) (t−1)
8 until J(θ(t) , θ0 ) − J(θ(t−1) , θ0 ) < 

(t)
9 return θ(t) , θ0

Study Question: Is it okay that λ doesn’t appear in line 7?

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 35

4 Stochastic Gradient Descent


When the form of the gradient is a sum, rather than take one big(ish) step in the direction
of the gradient, we can, instead, randomlyselect one term of the sum, and take a very The word “stochastic”
small step in that direction. This seems sort of crazy, but remember that all the little steps means probabilistic,
or random; so does
would average out to the same direction as the big step if you were to stay in one place. Of “aleatoric,” which is a
course, you’re not staying in that place, so you move, in expectation, in the direction of the very cool word. Look
gradient. up aleatoric music,
Most objective functions in machine learning can end up being written as a sum over sometime.
data points, in which case, stochastic gradient descent (SGD) is implemented by picking a
data point randomly out of the data set, computing the gradient as if there were only that
one point in the data set, and taking a small step in the negative direction.
Here is pseudocode for applying SGD to ridge regression:

SGD FOR REGRESSION


1 θ(0) = 0
2 for t = 1 to T
3 ∈ {1, 2, . . . , n}
randomly select i   
T λ (t)
4 θ(t+1) = θ(t) − ηt θ(t) x(i) − y(i) x(i) + nθ
5 return θ

Note that now instead of a fixed value of η, it is indexed by the iteration of the algorithm,
t. For SGD to converge to a local optimum as t increases, the step size has to decrease as a
function of time. To guarantee convergence, the following relationships must hold:

X ∞
X
ηt = ∞ and η2t < ∞ .
t=1 t=1

One “legal” way of setting the step size is to make ηt = 1/t but people often use rules that
decrease more slowly, and so don’t strictly satisfy the criteria for convergence.
Study Question: If you start a long way from the optimum, would making ηt de-
crease more slowly tend to make you move more quickly or more slowly to the opti-
mum?
It can be interesting to re-organize the terms in the gradient update (line 4 of the pseu-
docode) in this way:
 
(t+1) ληt  T

θ = 1− θ(t) + ηt y(i) − θ(t) x(i) x(i)
n | {z }
| {z }
error
λ pressures θ to be small

The first term will tend to move θ toward 0, and the second will tend to reduce the error
on the current data point.

Last Updated: 02/01/19 01:38:02


CHAPTER 7

Regression

Now we will turn to a slightly different form of machine-learning problem, called regres-
sion. It is still supervised learning, so our data will still have the form “Regression,” in com-

    mon parlance, means
Sn = x(1) , y(1) , . . . , x(n) , y(n) . moving backwards. But
this is forward progress!
But now, instead of the y values being discrete, they will be real-valued, and so our hy-
potheses will have the form
h : Rd → R .
This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc.
The first step is to pick a loss function, to describe how to evaluate the quality of the pre-
dictions our hypothesis is making, when compared to the “target” y values in the data set.
The choice of loss function is part of modeling your domain. In the absence of additional
information about a regression problem, we typically use squared error (SE):
Loss(guess, actual) = (guess − actual)2 .
It penalizes guesses that are too high the same amount as guesses that are too low, and
has a good mathematical justification in the case that your data are generated from an
underlying linear hypothesis, but with Gaussian-distributed noise added to the y values.
We will consider the case of a linear hypothesis class,
h(x; θ, θ0 ) = θT x + θ0 ,
remembering that we can get a rich class of hypotheses by performing a non-linear fea-
ture transformation before doing the regression. So, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x if ϕ is a non-linear function of x.
We will treat regression as an optimization problem, in which, given a data set D, we
wish to find a linear hypothesis that minimizes mean square error. Our objective, often
called mean squared error is to find values for θ, θ0 that minimize
1 X  T (i)
n 2
J(θ, θ0 ) = θ x + θ0 − yi ,
n
i=1

making the solution be


θ∗ , θ∗0 = arg min J(θ, θ0 ) . (7.1)
θ,θ0

36
MIT 6.036 Fall 2018 37

1 Analytical solution: ordinary least squares


One very interesting aspect of the problem finding a linear hypothesis that minimizes mean
squared error (this general problem is often called ordinary least squares (OLS) is that we can
find a closed-form formula for the answer! What does “closed
Everything is easier to deal with if we assume that the x(i) have been augmented with form” mean? Generally,
that it involves direct
an extra input dimension (feature) that always has value 1, so we may ignore θ0 . (See evaluation of a mathe-
chapter 3, section 2 for a reminder about this strategy). matical expression using
We will approach this just like a minimization problem from calculus homework: take a fixed number of “typ-
the derivative of J with respect to θ, set it to zero, and solve for θ. There is an additional ical” operations (like
arithmetic operations,
step required, to check that the resulting θ is a minimum (rather than a maximum or an in- trig functions, powers,
flection point) but we won’t work through that here. It is possible to approach this problem etc.). So equation 7.1 is
by: not in closed form, be-
cause it’s not at all clear
• Finding ∂J/∂θk for k in 1, . . . , d, what operations one
needs to perform to find
• Construct a set of k equations of the form ∂J/∂θk = 0, and the solution.

• Solving the system for values of θk . We will use d here for


the total number of fea-
tures in each x(i) , in-
That works just fine. To get practice for applying techniques like this to more complex cluding the added 1.
problems, we will work through a more compact (and cool!) matrix view.
Study Question: Work through this and check your answer against ours below.
We can think of our training data in terms of matrices X and Y, where each column of X
is an example, and each “column” of Y is the corresponding label:
 
(1) (n)
x1 . . . x1
 . .. ..   (1) 
X=  .. . .  Y= y
 . . . y(n) .
(1) (n)
xd . . . xd

In most textbooks, they think of an individual example x(i) as a row, rather than a
column. So that we get an answer that will be recognizable to you, we are going to define a
new matrix and vector, W and T , which are just transposes of our X and Y, and then work
with them:  
(1) (1)  (1) 
x1 . . . xd y
T
 . . .  T . 
W=X =  .. .. .. 
 T = Y =  ..  .

(n) (n)
x1 ... xd y(n)
Now we can write
1
J(θ) = (Wθ − T )T (Wθ − T )
n | {z } | {z }
1×n n×1

and using facts about matrix/vector calculus, we get

2
∇θ J = W T (Wθ − T ) .
n |{z} | {z }
d×n n×1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 38

Setting to 0 and solving, we get:

2 T
W (Wθ − T ) = 0
n
W T Wθ − W T T = 0
W T Wθ = W T T
θ = (W T W)−1 W T T

And the dimensions work out!


−1
θ = WT W WT T
| {z } |{z} |{z}
d×d d×n n×1

So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!

2 Regularization

Well, actually, there are some kinds of trouble we can get into. What if W T W is not
invertible?
Study Question: Consider, for example, a situation where the data-set is just the
same point repeated twice: x(1) = x(2) = (1, 2)T . What is W in this case? What is
W T W? What is (W T W)−1 ?
Another kind of problem is overfitting: we have formulated an objective that is just
about fitting the data as well as possible, but as we discussed in the context of margin
maximization, we might also want to regularize to keep the hypothesis from getting too
attached to the data.
We address both the problem of not being able to invert (W T W)−1 and the problem of
overfitting using a mechanism called ridge regression. We add a regularization term kθk2 to
the OLS objective, with trade-off parameter λ.
Study Question: When we add a regularizer of the form kθk2 , what is our most
“preferred” value of θ, in the absence of any data?

1 X  T (i)
n 2
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2
n
i=1

Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.
If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors), then we
get:
2
∇θ Jridge = W T (Wθ − T ) + 2λθ .
n

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 39

Setting to 0 and solving, we get:


2 T
W (Wθ − T ) + 2λθ = 0
n
1 T 1
W Wθ − W T T + λθ = 0
n n
1 T 1
W Wθ + λθ = W T T
n n
W T Wθ + nλθ = W T T
(W T W + nλI)θ = W T T
θ = (W T W + nλI)−1 W T T

Whew! So,
−1 T
θridge = W T W + nλI W T
which becomes invertible when λ > 0. This is called “ridge”
regression because we
Study Question: Derive the ridge regression formula. are adding a “ridge”
of λ values along the
diagonal of the matrix
Talking about regularization In machine learning in general, not just regression, it is before inverting it.
useful to distinguish two ways in which a hypothesis h ∈ H might engender errors on test
data. We have
Structural error: This is error that arises because there is no hypothesis h ∈ H that will
perform well on the data, for example because the data was really generated by a sin
wave but we are trying to fit it with a line.

Estimation error: This is error that arises because we do not have enough data (or the
data are in some way unhelpful) to allow us to choose a good h ∈ H.
When we increase λ, we tend to increase structural error but decrease estimation error, There are technical defi-
and vice versa. nitions of these concepts
that are studied in more
Study Question: Consider using a polynomial basis of order k as a feature transfor- advanced treatments
mation φ on your data. Would increasing k tend to increase or decrease structural? of machine learning.
What about estimation error? Structural error is re-
ferred to as bias and
estimation error is re-
ferred to as variance.
3 Optimization via gradient descent
Inverting the d × d matrix W T W takes O(d3 ) time, which makes the analytic solution

Well, actually, Gauss-
impractical for large d. If we have high-dimensional data, we can fall back on gradient Jordan elimination,
a popular algorithm,
descent.
takes O(d3 ) arithmetic
Study Question: Why is having large n not as much of a computational problem as operations, but the bit
having large d? complexity of the in-
termediate results can
Recall the ridge objective grow exponentially!
There are other algo-
1 X  T (i)
n 2 rithms with polynomial
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2 bit complexity. (If this
n just made no sense to
i=1
you, don’t worry.)
and its gradient with respect to θ

2 X  T (i)
n 
∇θ J = θ x + θ0 − y(i) x(i) + 2λθ
n
i=1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 40

and partial derivative with respect to θ0

2 X  T (i)
n
∂J 
= θ x + θ0 − y(i) .
∂θ0 n
i=1

Armed with these derivatives, we can do gradient descent, using the regular or stochastic
gradient methods from chapter 6.
Even better, the objective functions for OLS and ridge regression are convex, which
means they have only one minimum, which means, with a small enough step size, gra-
dient descent is guaranteed to find the optimum.

Last Updated: 02/01/19 01:38:02


CHAPTER 8

Neural Networks

Unless you live under a rock with no internet access, you’ve been hearing a lot about “neu-
ral networks.” Now that we have several useful machine-learning concepts (hypothesis
classes, classification, regression, gradient descent, regularization, etc.) we are completely
well equipped to understand neural networks in detail.
This, in some sense, the “third wave” of neural nets. The basic idea is founded on
the 1943 model of neurons of McCulloch and Pitts and learning ideas of Hebb. There
was a great deal of excitement, but not a lot of practical success: there were good train-
ing methods (e.g., perceptron) for linear functions, and interesting examples of non-linear
functions, but no good way to train non-linear functions from data. Interest died out for a
while, but was re-kindled in the 1980s when several people came up with a way to train As with many good
neural networks with “back-propagation,” which is a particular style of implementing gra- ideas in science, the
basic idea for how to
dient descent, which we will study here. By the mid-90s, the enthusiasm waned again, be- train non-linear neural
cause although we could train non-linear networks, the training tended to be slow and networks with gradi-
was plagued by a problem of getting stuck in local optima. Support vector machines ent descent, was inde-
(SVMs) (regularization of high-dimensional hypotheses by seeking to maximize the mar- pendently developed
my more than one re-
gin) and kernel methods (an efficient and beautiful way of using feature transformations searcher.
to non-linearly transform data into a higher-dimensional space) provided reliable learning
methods with guaranteed convergence and no local optima.
However, during the SVM enthusiasm, several groups kept working on neural net-
works, and their work, in combination with an increase in available data and computation,
has made them rise again. They have become much more reliable and capable, and are
now the method of choice in many applications. There are many, many variations of neu- The number increases
ral networks, which we can’t even begin to survey. We will study the core “feed-forward” daily, as may be seen on
arxiv.org.
networks with “back-propagation” training, and then, in later chapters, address some of
the major advances beyond this core.
We can view neural networks from several different perspectives:

View 1: An application of stochastic gradient descent for classification and regression


with a potentially very rich hypothesis class.

View 2: A brain-inspired network of neuron-like computing elements that learn dis-


tributed representations.

View 3: A method for building applications that make predictions based on huge amounts

41
MIT 6.036 Fall 2018 42

of data in very complex domains.

We will mostly take view 1, with the understanding that the techniques we develop will
enable the applications in view 3. View 2 was a major motivation for the early development
of neural networks, but the techniques we will study do not seem to actually account for Some prominent re-
the biological learning processes in brains. searchers are, in fact,
working hard to find
analogues of these
methods in the brain
1 Basic element
The basic element of a neural network is a “neuron,” pictured schematically below. We will
also sometimes refer to a neuron as a “unit” or “node.”
pre-activation
x1 w1
output

.. P z
. f(·) a
wm
xm w0
activation

input

It is a non-linear function of an input vector x ∈ Rm to a single output value a ∈ R. It is Sorry for changing our
parameterized by a vector of weights (w1 , . . . , wm ) ∈ Rm and an offset or threshold w0 ∈ R. notation here. We were
using d as the dimen-
In order for the neuron to be non-linear, we also specify an activation function f : R → R, sion of the input, but
which can be the identity (f(x) = x), but can also be any other function, though we will we are trying to be con-
only be able to work with it if it is differentiable. sistent here with many
The function represented by the neuron is expressed as: other accounts of neural
networks. It is impossi-
  ble to be consistent with
Xm
all of them though—
a = f(z) = f  xj wj + w0  = f(wT x + w0 ) . there are many differ-
j=1 ent ways of telling this
story.
Before thinking about a whole network, we can consider how to train a single unit.
Given a loss function L(guess, actual) and a dataset {(x(1) , y(1) ), . . . , (x(n) , y(n) )}, we can do This should remind you
of our θ and θ0 for lin-
(stochastic) gradient descent, adjusting the weights w, w0 to minimize ear models.
X  
L NN(x(i) ; w, w0 ), y(i) .
i

where NN is the output of our neural net for a given input.


We have already studied two special cases of the neuron: linear classifiers with hinge
loss and regressors with quadratic loss! Both of these have activation functions f(x) = x.
Study Question: Just for a single neuron, imagine for some reason, that we decide
to use activation function f(z) = ez and loss function L(g, a) = (g − a)2 . Derive a
gradient descent update for w and w0 .

2 Networks
Now, we’ll put multiple neurons together into a network. A neural network in general
takes in an input x ∈ Rm and generates an output a ∈ Rn . It is constructed out of multiple

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 43

neurons; the inputs of each neuron might be elements of x and/or outputs of other neurons.
The outputs are generated by n output units.
In this chapter, we will only consider feed-forward networks. In a feed-forward network,
you can think of the network as defining a function-call graph that is acyclic: that is, the
input to a neuron can never depend on that neuron’s output. Data flows, one way, from
the inputs to the outputs, and the function computed by the network is just a composition
of the functions computed by the individual neurons.
Although the graph structure of a neural network can really be anything (as long as it
satisfies the feed-forward constraint), for simplicity in software and analysis, we usually
organize them into layers. A layer is a group of neurons that are essentially “in parallel”:
their inputs are outputs of neurons in the previous layer, and their outputs are the input to
the neurons in the next layer. We’ll start by describing a single layer, and then go on to the
case of multiple layers.

2.1 Single layer


A layer is a set of units that, as we have just described, are not connected to each other. The
layer is called fully connected if, as in the diagram below, the inputs to each unit in the layer
are the same (i.e. x1 , x2 , . . . xm in this case). A layer has input x ∈ Rm and output (also
known as activation) a ∈ Rn .

P
f a1

x1 P
f a2

x2
P
f a3
..
.

xm .. .. ..
. . .

W, W0 P
f an

Since each unit has a vector of weights and a single offset, we can think of the weights of
the whole layer as a matrix, W, and the collection of all the offsets as a vector W0 . If we
have m inputs, n units, and n outputs, then

• W is an m × n matrix,

• W0 is an n × 1 column vector,

• X, the input, is an m × 1 column vector,

• Z, the pre-activation, is an n × 1 column vector,

• A, the activation, is an n × 1 column vector,

and we can represent the output vector as follows

A = f(Z) = f(W T X + W0 ) .

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 44

The activation function f is applied element-wise to the pre-activation values Z.


What can we do with a single layer? We have already seen single-layer networks, in the
form of linear separators and linear regressors. All we can do with a single layer is make
a linear hypothesis (with some possible linear transformation on the output). The whole
reason for moving to neural networks is to move in the direction of non-linear hypotheses.
To do this, we will have to consider multiple layers.

2.2 Many layers


A single neural network generally combines multiple layers, most typically by feeding the
outputs of one layer into the inputs of another
We have to start by establishing some nomenclature. We will use l to name a layer, and
let ml be the number of inputs to the layer and nl be the number of outputs from the layer.
Then, W l and W0l are of shape ml × nl and nl × 1, respectively. Let fl be the activation
function of layer l. Then, the pre-activation outputs are the nl × 1 vector It is technically possi-
ble to have different
T activation functions
Zl = W l Al−1 + W0l
within the same layer,
but, again, for conve-
and the activated outputs are simply the nl × 1 vector nience in specification
and implementation,
Al = fl (Zl ) . we generally have the
same activation function
Here’s a diagram of a many-layered network, with two blocks for each layer, one rep- within a layer.
resenting the linear part of the operation and one representing the non-linear activation
function. We will use this structural decomposition to organize our algorithmic thinking
and implementation.

X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL
f1 f2 ··· fL
W01 W02 W0L

layer 1 layer 2 layer L

3 Choices of activation function


There are many possible choices for the activation function. We will start by thinking about
whether it’s really necessary to have an f at all.
What happens if we let f be the identity? Then, in a network with L layers (we’ll leave
out W0 for simplicity, but it won’t change the form of this argument),
T T T T
AL = W L AL−1 = W L W L−1 · · · W 1 X

So, multiplying out the weight matrices, we find that

AL = W total X ,

which is a linear function of X! Having all those layers did not change the representational
capacity of the network: the non-linearity of the activation function is crucial.
Study Question: Convince yourself that any function representable by any number
of linear layers (where f is the identity function) can be represented by a single layer.
Now that we are convinced we need a non-linear activation, let’s examine a few com-
mon choices.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 45

Step function 
0 if z < 0
step(z) =
1 otherwise

Rectified linear unit 


0 if z < 0
ReLU(z) = = max(0, z)
z otherwise

Sigmoid function Also known as a logistic function, can be interpreted as probability,


because for any value of z the output is in [0, 1]

1
σ(z) =
1 + e−z

Hyperbolic tangent Always in the range [−1, 1]

ez − e−z
tanh(z) =
ez + e−z

Softmax function Takes a whole a vector Z ∈ Rn and generates as output a vector A ∈


P
[0, 1]n with the property that n i=1 Ai = 1, which means we can interpret it as a
probability distribution over n items:
 P 
exp(z1 )/ i exp(zi )
softmax(z) = 
 .. 
.
P

exp(zn )/ i exp(zi )

1.5 1.5
step(z) ReLU(z)
1 1

0.5 0.5
z z
−2 −1 1 2 −2 −1 1 2
−0.5 −0.5

σ(z) tanh(z)
1 1

0.5 0.5

z z
−4 −2 2 4 −4 −2 2 4

−0.5 −0.5

−1 −1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 46

The original idea for neural networks involved using the step function as an activa-
tion, but because the derivative is discontinuous, we won’t be able to use gradient-descent
methods to tune the weights in a network with step functions, so we won’t consider them
further. They have been replaced, in a sense, by the sigmoid, relu, and tanh activation
functions.
Study Question: Consider sigmoid, relu, and tanh activations. Which one is most
like a step function? Is there an additional parameter you could add to a sigmoid
that would make it be more like a step function?

Study Question: What is the derivative of the relu function? Are there some values
of the input for which the derivative vanishes?
ReLUs are especially common in internal (“hidden”) layers, and sigmoid activations are
common for the output for binary classification and softmax for multi-class classification
(see section 6.1 for an explanation).

4 Error back-propagation
We will train neural networks using gradient descent methods. It’s possible to use batch
gradient descent, in which we sum up the gradient over all the points (as in section 2 of
chapter 6) or stochastic gradient descent (SGD), in which we take a small step with respect
to the gradient considering a single point at a time (as in section 4 of chapter 6).
Our notation is going to get pretty hairy pretty quickly. To keep it as simple as we can,
we’ll focus on taking the gradient with respect to a single point, for SGD; you can simply
sum up these gradients over all the data points if you wish to do batch descent.
So, to do SGD for a training example (x, y), we need to compute ∇W Loss(NN(x; W), y),
where W represents all weights W l , W0l in all the layers l = (1, . . . , L). This seems terrifying,
but is actually quite easy to do using the chain rule. Remember the chain
Remember that we are always computing the gradient of the loss function with respect rule! If a = f(b) and
b = g(c) (so that
to the weights for a particular value of (x, y). That tells us how much we want to change the a = f(g(c))), then
weights, in order to reduce the loss incurred on this particular training example. da
= da · db =
dc db dc
First, let’s see how the loss depends on the weights in the final layer, W L . Remembering f 0 (b)g 0 (c) =
that our output is AL , and using the shorthand loss to stand for Loss((NN(x; W), y) which f 0 (g(c))g 0 (c).
T
is equal to Loss(AL , y), and finally that AL = fL (ZL ) and ZL = W L AL−1 , we can use the
chain rule:
∂loss ∂loss ∂AL ∂ZL
L
= L
· L
· L
.
∂W |∂A
{z } |∂Z
{z } |∂W
{z }
depends on loss function fL 0 AL−1
It might reason-
To actually get the dimensions to match, we need to write this a bit more carefully, and ably bother you that
note that it is true for any l, including l = L: ∂ZL /∂W L = AL−1 .
We’re somehow think-
 T ing about the deriva-
∂loss l−1 ∂loss tive of a vector with re-
l
= |A{z } (8.1) spect to a matrix, which
∂W
| {z } ∂Zl
lm ×1 | {z } seems like it might
ml ×nl 1×nl need to be a three-
dimensional thing. But
Yay! So, in order to find the gradient of the loss with respect to the weights in the other note that ∂ZL /∂W L is
layers of the network, we just need to be able to find ∂loss/∂Zl . T
really ∂W L AL−1 /∂W L
If we repeatedly apply the chain rule, we get this expression for the gradient of the loss and it seems okay in at
least an informal sense
that it’s AL−1 .

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 47

with respect to the pre-activation in the first layer:


∂loss ∂loss ∂AL ∂ZL ∂AL−1 ∂A2 ∂Z2 ∂A1
1
= · · · · ··· · · · . (8.2)
∂Z |∂A
L ∂ZL L−1
∂A {z ∂Z L−1 ∂Z2} ∂A1 ∂Z1
∂loss/∂Z2
| {z }
∂loss/∂A1

This derivation was informal, to show you the general structure of the computation. In
fact, to get the dimensions to all work out, we just have to write it backwards! Let’s first
understand more about these quantities:
• ∂loss/∂AL is nL × 1 and depends on the particular loss function you are using.
• ∂Zl /∂Al−1 is ml × nl and is just W l (you can verify this by computing a single entry
∂Zli /∂Al−1
j ).

• ∂Al /∂Zl is nl × nl . It’s a little tricky to think about. Each element ali = fl (zli ). This
means that ∂ali /∂zlj = 0 whenever i 6= j. So, the off-diagonal elements of ∂Al /∂Zl are
0
all 0, and the diagonal elements are ∂ali /∂zlj = fl (zlj ).
Now, we can rewrite equation 8.2 so that the quantities match up as
∂loss ∂Al ∂Al+1 ∂AL−1 ∂AL ∂loss
l
= l
· W l+1 · l+1
· . . . W L−1 · L−1
· WL · · . (8.3)
∂Z ∂Z ∂Z ∂Z ∂ZL AL
Using equation 8.3 to compute ∂loss/∂Zl combined with equation 8.1, lets us find the
gradient of the loss with respect to any of the weight matrices.
Study Question: Apply the same reasoning to find the gradients of loss with respect
to W0l .
This general process is called error back-propagation. The idea is that we first do a forward
pass to compute all the a and z values at all the layers, and finally the actual loss on this
example. Then, we can work backward and compute the gradient of the loss with respect
to the weights in each layer, starting at layer L and going back to layer 1. I like to think of this as
“blame propagation”.
y You can think of loss
as how mad we are
about the prediction
X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL that the network just
f1 f2 ··· fL Loss
W01 W02 W0L made. Then ∂loss/∂AL
is how much we blame
∂loss ∂loss ∂loss ∂loss ∂loss ∂loss ∂loss AL for the loss. The last
∂Z1 ∂A1 ∂Z2 ∂A2 ∂AL−1 ∂ZL ∂AL
module has to take in
∂loss/∂AL and com-
If we view our neural network as a sequential composition of modules (in our work pute ∂loss/∂ZL , which
so far, it has been an alternation between a linear transformation with a weight matrix, is how much we blame
and a component-wise application of a non-linear activation function), then we can define ZL for the loss. The
next module (work-
a simple API for a module that will let us compute the forward and backward passes, as ing backwards) takes
well as do the necessary weight updates for gradient descent. Each module has to provide in ∂loss/∂ZL and com-
the following “methods.” We are already using letters a, x, y, z with particular meanings, putes ∂loss/∂AL−1 . So
so here we will use u as the vector input to the module and v as the vector output: every module is accept-
ing its blame for the
• forward: u → v loss, computing how
much of it to allocate to
• backward: u, v, ∂L/∂v → ∂L/∂u each of its inputs, and
passing the blame back
• weight grad: u, ∂L/∂v → ∂L/∂W only needed for modules that have weights W to them.

In homework we will ask you to implement these modules for neural network components,
and then use them to construct a network and train it as described in the next section.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 48

5 Training
Here we go! Here’s how to do stochastic gradient descent training on a feed-forward neural
network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3.
The actual computation of the gradient values (e.g. ∂loss/∂AL ) is not directly defined in
this code, because we want to make the structure of the computation clear.
Study Question: What is ∂Zl /∂W l ?

Study Question: Which terms in the code below depend on fL ?

T RAIN -N EURAL -N ET(Dn , T , L, (m1 , . . . , mL ), (f1 , . . . , fL ))


1 for l = 1 to L
2 Wijl
∼ Gaussian(0, 1/ml )
3 W0j ∼ Gaussian(0, 1)
l

4 for t = 1 to T
5 i = random sample from {1, . . . , n}
6 A0 = x(i)
7 // forward pass to compute the output AL
8 for l = 1 to L
9 Zl = W lT Al−1 + W0l
10 Al = fl (Zl )
11 loss = Loss(AL , y(i) )
12 for l = L to 1:
13 // error back-propagation
14 ∂loss/∂Al = if l < L then ∂loss/∂Zl+1 · ∂Zl+1 /∂Al else ∂loss/∂AL
15 ∂loss/∂Zl = ∂loss/∂Al · ∂Al /∂Zl
16 // compute gradient with respect to weights
17 ∂loss/∂W l = ∂loss/∂Zl · ∂Zl /∂W l
18 ∂loss/∂W0l = ∂loss/∂Zl · ∂Zl /∂W0l
19 // stochastic gradient descent update
20 W l = W l − η(t) · ∂loss/∂W l
21 W0l = W0l − η(t) · ∂loss/∂W0l

Initializing W is important; if you do it badly there is a good chance the neural network
training won’t work well. First, it is important to initialize the weights to random val-
ues. We want different parts of the network to tend to “address” different aspects of the
problem; if they all start at the same weights, the symmetry will often keep the values
from moving in useful directions. Second, many of our activation functions have (near)
zero slope when the pre-activation z values have large magnitude, so we generally want to
keep the initial weights small so we will be in a situation where the gradients are non-zero,
so that gradient descent will have some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random from a Gaussian
(normal) distribution with mean 0 and standard deviation (1/m) where m is the number
of inputs to the unit.
Study Question: If the input x to this unit is a vector of 1’s, what would the ex-
pected pre-activation z value be with these initial weights?
We write this choice (where ∼ means “is drawn randomly from the distribution”)
 
1
l
Wij ∼ Gaussian 0, l .
m

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 49

It will often turn out (especially for fancier activations and loss functions) that comput-
ing
∂loss
∂ZL
is easier than computing
∂loss ∂AL
and .
∂AL ∂ZL
So, we may instead ask for an implementation of a loss function to provide a backward
method that computes ∂loss/∂ZL directly.

6 Loss functions and activation functions


Different loss functions make different assumptions about the range of inputs they will get
as input and, as we have seen, different activation functions will produce output values in
different ranges. When you are designing a neural network, it’s important to make these
things fit together well. In particular, we will think about matching loss functions with the
activation function in the last layer, fL . Here is a table of loss functions and activations that
make sense for them:
Loss fL
squared linear
hinge linear
NLL sigmoid
NLLM softmax

But what is NLL?

6.1 Two-class classification and log likelihood


For classification, the natural loss function is 0-1 loss, but we have already discussed the
fact that it’s very inconvenient for gradient-based learning because its derivative is discon-
tinuous. Hinge loss gives us a way, for binary classification problems, to make a smoother
objective. An alternative loss function that has a nice probabilistic interpretation, is in pop-
ular use, and extends nicely to multi-class classification is called negative log likelihood (NLL).
We will discuss it first in the two-class case, and then generalize to multiple classes.
Let’s assume that the activation function on the output layer is a sigmoid and that there
is a single unit in the output layer, so the output of the whole neural network is a scalar,
aL . Because fL is a sigmoid, we know aL ∈ [0, 1], and we can interpret it as the probability
that the input x is a positive example. Let us further assume that the labels in the training
data are y ∈ {0, 1}, so they can also be interpreted as probabilities.
We might want to pick the parameters of our network to maximize the probability that
the network assigns the correct labels to all the points. That would be

Yn
a(i) if y(i) = 1
(i)
,
i=1
1−a otherwise

under the assumption that our predictions are independent. This can be cleverly rewritten
as
Yn
y(i) (i)
a(i) (1 − a(i) )1−y .
i=1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 50

Study Question: Be sure you can see why these two expressions are the same.
Now, because products are kind of hard to deal with, and because the log function is
monotonic, the W that maximizes the log of this quantity will be the same as the W that
maximizes the original, so we can try to maximize

X
n
y(i) log a(i) + (1 − y(i) ) log(1 − a(i) ) ,
i=1

which we can write in terms of a loss function


X
n
Lnll (a(i) , y(i) )
i=1

where Lnll is the negative log likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
for the logarithm and
it won’t make any real
6.2 Multi-class classification and log likelihood difference. If we ask
you for numbers, use
We can extend this idea directly to multi-class classification with K classes, where the train- log base e.
 T
ing label is represented with the one-hot vector y = y1 , . . . , yK , where yk = 1 if the
example is of class k. Assume that our network uses softmax as the activation function in
 T
the last layer, so that the output is a = a1 , . . . , aK , which represents a probability dis-
tribution over the K possible classes. Then, the probability that our network predicts the
Q
correct class for this example is K yk
k=1 ak and the log of the probability that it is correct is
PK
k=1 yk log ak , so

X
K
Lnllm (guess, actual) = − actualk · log(guessk ) .
k=1

We’ll call this NLLM for negative log likelihood multiclass.


Study Question: Show that Lnllm for K = 2 is the same as Lnll .

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 51

7 Optimizing neural network parameters


Because neural networks are just parametric functions, we can optimize loss with respect to
the parameters using standard gradient-descent software, but we can take advantage of the
structure of the loss function and the hypothesis class to improve optimization. As we have
seen, the modular function-composition structure of a neural network hypothesis makes it
easy to organize the computation of the gradient. As we have also seen earlier, the structure
of the loss function as a sum over terms, one per training data point, allows us to consider
stochastic gradient methods. In this section we’ll consider some alternative strategies for
organizing training, and also for making it easier to handle the step-size parameter.

7.1 Batches
Assume that we have an objective of the form

X
n
J(W) = L(h(x(i) ; W), y(i) ) ,
i=1

where h is the function computed by a neural network, and W stands for all the weight
matrices and vectors in the network.
When we perform batch gradient descent, we use the update rule

W := W − η∇W J(W) ,

which is equivalent to

X
n
W := W − η ∇W L(h(x(i) ; W), y(i) ) .
i=1

So, we sum up the gradient of loss at each training point, with respect to W, and then take
a step in the negative direction of the gradient.
In stochastic gradient descent, we repeatedly pick a point (x(i) , y(i) ) at random from the
data set, and execute a weight update on that point alone:

W := W − η∇W L(h(x(i) ; W), y(i) ) .

As long as we pick points uniformly at random from the data set, and decrease η at an
appropriate rate, we are guaranteed to converge to at least a local optimum.
These two methods have offsetting virtues. The batch method takes steps in the exact
gradient direction but requires a lot of computation before even a single step can be taken,
especially if the data set is large. The stochastic method begins moving right away, and can
sometimes make very good progress before looking at even a substantial fraction of the
whole data set, but if there is a lot of variability in the data, it might require a very small η
to effectively average over the individual steps moving in “competing” directions.
An effective strategy is to “average” between batch and stochastic gradient descent by
using mini-batches. For a mini-batch of size k, we select k distinct data points uniformly
at random from the data set and do the update based just on their contributions to the
gradient
Xk
W := W − η ∇W L(h(x(i) ; W), y(i) ) .
i=1

Most neural network software packages are set up to do mini-batches.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 52

Study Question: For what value of k is mini-batch gradient descent equivalent to


stochastic gradient descent? To batch gradient descent?
Picking k unique data points at random from a large data-set is potentially computa-
tionally difficult. An alternative strategy, if you have an efficient procedure for randomly
shuffling the data set (or randomly shufffling a list of indices into the data set) is to operate
in a loop, roughly as follows:

M INI -B ATCH -SGD(NN, data, k)


1 n = length(data)
2 while not done:
3 R ANDOM -S HUFFLE(data)
4 for i = 1 to n/k
5 B ATCH -G RADIENT-U PDATE(NN, data[(i − 1)k : ik])

7.2 Adaptive step-size


Picking a value for η is difficult and time-consuming. If it’s too small, then convergence is
slow and if it’s too large, then we risk divergence or slow convergence due to oscillation.
This problem is even more pronounced in stochastic or mini-batch mode, because we know
we need to decrease the step size for the formal guarantees to hold.
It’s also true that, within a single neural network, we may well want to have differ-
ent step sizes. As our networks become deep (with increasing numbers of layers) we can
find that magnitude of the gradient of the loss with respect the weights in the last layer,
∂loss/∂WL , may be substantially different from the gradient of the loss with respect to the
weights in the first layer ∂loss/∂WL . If you look carefully at equation 8.3, you can see that
the output gradient is multiplied by all the weight matrices of the network and is “fed
back” through all the derivatives of all the activation functions. This can lead to a problem
of exploding or vanishing gradients, in which the back-propagated gradient is much too big
or small to be used in an update rule with the same step size.
So, we’ll consider having an independent step-size parameter for each weight, and up-
dating it based on a local view of how the gradient updates have been going. This section is very
strongly influenced
by Sebastian Ruder’s
7.2.1 Running averages excellent blog posts on
the topic: ruder.io/
We’ll start by looking at the notion of a running average. It’s a computational strategy for optimizing-gradient-descent
estimating a possibly weighted average of a sequence of data. Let our data sequence be
a1 , a2 , . . .; then we define a sequence of running average values, A0 , A1 , A2 , . . . using the
equations

A0 = 0
At = γt At−1 + (1 − γt )at

where γt ∈ (0, 1). If γt is a constant, then this is a moving average, in which

AT = γAT −1 + (1 − γ)aT
= γ(AT −2 + (1 − γ)aT −1 ) + (1 − γ)aT
X
T
= γT −t (1 − γ)at
t=0

So, you can see that inputs at closer to the end of the sequence have more effect on At than
early inputs.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 53

If, instead, we set γt = (t − 1)/t, then we get the actual average.


Study Question: Prove to yourself that the previous assertion holds.

7.2.2 Momentum
Now, we can use methods that are a bit like running averages to describe strategies for
computing η. The simplest method is momentum, in which we try to “average” recent
gradient updates, so that if they have been bouncing back and forth in some direction, we
take out that component of the motion. For momentum, we have

V0 = 0
Vt = γVt−1 + η∇W J(Wt−1 )
Wt = Wt−1 − Vt

This doesn’t quite look like an adaptive step size. But what we can see is that, if we let
η = η 0 (1 − γ), then the rule looks exactly like doing an update with step size η 0 on a
moving average of the gradients with parameter γ:

V0 = 0
Mt = γMt−1 + (1 − γ)∇W J(Wt−1 )
Wt = Wt−1 − η 0 Mt

We will find that Vt will be bigger in dimensions that consistently have the same sign for
∇θ and smaller for those that don’t. Of course we now have two parameters to set, but
the hope is that the algorithm will perform better overall, so it will be worth trying to find
good values for them. Often γ is set to be something like 0.9.

The red arrows show the update after one step of mini-batch gradient descent with
momentum. The blue points show the direction of the gradient with respect to
the mini-batch at each step. Momentum smooths the path taken towards the local
minimum and leads to faster convergence.

Study Question: If you set γ = 0.1, would momentum have more of an effect or less
of an effect than if you set it to 0.9?

7.2.3 Adadelta
Another useful idea is this: we would like to take larger steps in parts of the space where
J(W) is nearly flat (because there’s no risk of taking too big a step due to the gradient

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 54

being large) and smaller steps when it is steep. We’ll apply this idea to each weight in-
dependently, and end up with a method called adadelta, which is a variant on adagrad (for
adaptive gradient). Even though our weights are indexed by layer, input unit and output
unit, for simplicity here, just let Wj be any weight in the network (we will do the same
thing for all of them).
gt,j = ∇W J(Wt−1 )j
Gt,j = γGt−1,j + (1 − γ)g2t,j
η
Wt,j = Wt−1,j − p gt,j
Gt,j + 
The sequence Gt,j is a moving average of the square of the jth component of the gradient.
We square it in order to be insensitive to the sign—we want to know whether the magni-
tude is p big or small. Then, we perform a gradient update to weight j, but divide the step
size by Gt,j + , which is larger when the surface is steeper in direction j at point Wt−1 in
weight space; this means that the step size will be smaller when it’s steep and larger when
it’s flat.

7.2.4 Adam
Adam has become the default method of managing step sizes neural networks. It combines Although, interestingly,
the ideas of momentum and and adadelta. We start by writing moving averages of the it may actually violate
the convergence
gradient and squared gradient, which reflect estimates of the mean and variance of the conditions of SGD:
gradient for weight j: arxiv.org/abs/1705.08292

gt,j = ∇W J(Wt−1 )j
mt,j = B1 mt−1,j + (1 − B1 )gt,j
vt,j = B2 vt−1,j + (1 − B2 )g2t,j .
A problem with these estimates is that, if we initialize m0 = v0 = 0, they will always be
biased (slightly too small). So we will correct for that bias by defining
mt,j
m̂t,j =
1 − Bt1
vt,j
v̂t,j =
1 − Bt2
η
Wt,j = Wt−1,j − p m̂t,j .
v̂t,j + 
Note that Bt1 is B1 raised to the power t, and likewise for Bt2 . To justify these corrections,
note that if we were to expand mt,j in terms of m0,j and g0,j , g1,j , . . . , gt,j the coefficients
would sum to 1. However, the coefficient behind m0,j is Bt1 and since m0,j = 0, the sum of
coefficients of non-zero terms is 1 − Bt1 , hence the correction. The same justification holds
for vt,j .
Now, our update for weight j has a step size that takes the steepness into account, as in
adadelta, but also tends to move in the same direction, as in momentum. The authors of
this method propose setting B1 = 0.9, B2 = 0.999,  = 10−8 . Although we now have even
more parameters, Adam is not highly sensitive to their values (small changes do not have
a huge effect on the result).
Study Question: Define m̂j directly as a moving average of gt,j . What is the decay?
Even though we now have a step-size for each weight, and we have to update vari-
ous quantities on each iteration of gradient descent, it’s relatively easy to implement by
`
maintaining a matrix for each quantity (m`t , v`t , g`t , g2t ) in each layer of the network.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 55

8 Regularization
So far, we have only considered optimizing loss on the training data as our objective for
neural network training. But, as we have discussed before, there is a risk of overfitting if
we do this. The pragmatic fact is that, in current deep neural networks, which tend to be
very large and to be trained with a large amount of data, overfitting is not a huge problem.
This runs counter to our current theoretical understanding and the study of this question
is a hot area of research. Nonetheless, there are several strategies for regularizing a neural
network, and they can sometimes be important.

8.1 Methods related to ridge regression


One group of strategies can, interestingly, be shown to have similar effects: early stopping,
weight decay, and adding noise to the training data. Result is due to
Early stopping is the easiest to implement and is in fairly common use. The idea is Bishop, described
in his textbook and
to train on your training set, but at every epoch (pass through the whole training set, or here doi.org/10.1162/
possibly more frequently), evaluate the loss of the current W on a validation set. It will neco.1995.7.1.108.
generally be the case that the loss on the training set goes down fairly consistently with
each iteration, the loss on the validation set will initially decrease, but then begin to increase
again. Once you see that the validation loss is systematically increasing, you can stop
training and return the weights that had the lowest validation error.
Another common strategy is to simply penalize the norm of all the weights, as we did in
ridge regression. This method is known as weight decay, because when we take the gradient
of the objective
Xn
J(W) = Loss(NN(x(i) ), y(i) ; W) + λkWk2
i=1

we end up with an update of the form


  
Wt = Wt−1 − η ∇W Loss(NN(x(i) ), y(i) ; Wt−1 ) + λWt−1
 
= Wt−1 (1 − λη) − η ∇W Loss(NN(x(i) ), y(i) ; Wt−1 ) .

This rule has the form of first “decaying” Wt−1 by a factor of (1 − λη) and then taking a
gradient step.
Finally, the same effect can be achieved by perturbing the x(i) values of the training data
by adding a small amount of zero-mean normally distributed noise before each gradient
computation. It makes intuitive sense that it would be more difficult for the network to
overfit to particular training data if they are changed slightly on each training step.

8.2 Dropout
Dropout is a regularization method that was designed to work with deep neural networks.
The idea behind it is, rather than perturbing the data every time we train, we’ll perturb the
network! We’ll do this by randomly, on each training step, selecting a set of units in each
layer and prohibiting them from participating. Thus, all of the units will have to take a
kind of “collective” responsibility for getting the answer right, and will not be able to rely
on any small subset of the weights to do all the necessary computation. This tends also to
make the network more robust to data perturbations.
During the training phase, for each training example, for each unit, randomly with
probability p temporarily set a`j := 0. There will be no contribution to the output and no
gradient update for the associated unit.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 56

Study Question: Be sure you understand why, when using SGD, setting an activa-
tion value to 0 will cause that unit’s weights not to be updated on that iteration.
When we are done training and want to use the network to make predictions, we mul-
tiply all weights by p to achieve the same average activation levels.
Implementing dropout is easy! In the forward pass during training, we let

a` = f(z` ) ∗ d`

where ∗ denotes component-wise product and d` is a vector of 0’s and 1’s drawn randomly
with probability p. The backwards pass depends on a` , so we do not need to make any
further changes to the algorithm.
It is common to set p to 0.5, but this is something one might experiment with to get
good results on your problem and data.

8.3 Batch Normalization


A slightly more modern alternative to dropout, which achieves somewhat better perfor-
mance, is batch normalization. It was originally developed to address a problem of covariate For more details see
shift: that is, if you consider the second layer of a two-layer neural network, the distribution arxiv.org/abs/1502.03167.

of its input values is changing over time as the first layer’s weights change. Learning when
the input distribution is changing is extra difficult: you have to change your weights to im-
prove your predictions, but also just to compensate for a change in your inputs (imagine,
for instance, that the magnitude of the inputs to your layer is increasing over time—then
your weights will have to decrease, just to keep your predictions the same).
So, when training with mini-batches, the idea is to standardize the input values for each
mini-batch, just in the way that we did it in section of chapter 4, subtracting off the mean
and dividing by the standard deviation of each input dimension. This means that the scale
of the inputs to each layer remains the same, no matter how the weights in previous layers
change. However, this somewhat complicates matters, because the computation of the
weight updates will need to take into account that we are performing this transformation.
In the modular view, batch normalization can be seen as a module that is applied to al ,
interposed after the output of fl and before the product with W l+1 .
Batch normalization ends up having a regularizing effect for similar reasons that adding
noise and dropout do: each mini-batch of data ends up being mildly perturbed, which
prevents the network from exploiting very particular values of the data points.

Last Updated: 02/01/19 01:38:02


CHAPTER 9

Convolutional Neural Networks

So far, we have studied what are called fully connected neural networks, in which all of the
units at one layer are connected to all of the units in the next layer. This is a good arrange-
ment when we don’t know anything about what kind of mapping from inputs to outputs
we will be asking the network to learn to approximate. But if we do know something about
our problem, it is better to build it into the structure of our neural network. Doing so can
save computation time and significantly diminish the amount of training data required to
arrive at a solution that generalizes robustly.
One very important application domain of neural networks, where the methods have
achieved an enormous amount of success in recent years, is signal processing. Signals
might be spatial (in two-dimensional camera images or three-dimensional depth or CAT
scans) or temporal (speech or music). If we know that we are addressing a signal-processing
problem, we can take advantage of invariant properties of that problem. In this chapter, we
will focus on two-dimensional spatial problems (images) but use one-dimensional ones as
a simple example. Later, we will address temporal problems.
Imagine that you are given the problem of designing and training a neural network that
takes an image as input, and outputs a classification, which is positive if the image contains
a cat and negative if it does not. An image is described as a two-dimensional array of pixels, A pixel is a “picture ele-
each of which may be represented by three integer values, encoding intensity levels in red, ment.”
green, and blue color channels.
There are two important pieces of prior structural knowledge we can bring to bear on
this problem:

• Spatial locality: The set of pixels we will have to take into consideration to find a cat
will be near one another in the image. So, for example, we
won’t have to consider
• Translation invariance: The pattern of pixels that characterizes a cat is the same no some combination of
matter where in the image the cat occurs. pixels in the four cor-
ners of the image, in
We will design neural network structures that take advantage of these properties. case they encode cat-
ness.

Cats don’t look differ-


ent if they’re on the left
or the right side of the
image.

57
MIT 6.036 Fall 2018 58

1 Filters
We begin by discussing image filters. An image filter is a function that takes in a local spatial Unfortunately in AI/M-
neighborhood of pixel values and detects the presence of some pattern in that data. L/CS/Math, the word
“filter” gets used in
Let’s consider a very simple case to start, in which we have a 1-dimensional binary many ways: in addition
“image” and a filter F of size two. The filter is a vector of two numbers, which we will to the one we describe
move along the image, taking the dot product between the filter values and the image here, it can describe a
values at each step, and aggregating the outputs to produce a new image. temporal process (in
fact, our moving aver-
Let X be the original image, of size d; then the output image is specified by ages are a kind of filter)
and even a somewhat
Yi = FT (Xi−1 , Xi )T esoteric algebraic struc-
ture.
. To ensure that the output image is also of dimension d, we will generally “pad” the input
image with 0 values if we need to access pixels that are beyond the bounds of the input
image. This process of applying the filter to the image to create a new image is called
“convolution.” And filters are also
Here is a concrete example. Let the filter F1 = (−1, +1). Then given the first image sometimes called con-
volutional kernels.
below, we can convolve it with filter F1 to obtain the second image. You can think of this
filter as a detector for “left edges” in the original image—to see this, look at the places
where there is a 1 in the output image, and see what pattern exists at that position in the
input image. Another interesting filter is F2 = (−1, +1, −1). The third image below shows
the result of convolving the first image with F2 .
Study Question: Convince yourself that this filter can be understood as a detector
for isolated positive pixels in the binary image.

Image: 0 0 1 1 1 0 1 0 0 0

F1 : -1 +1

After convolution (w/ F1 ): 0 1 0 0 -1 1 -1 0 0

F2 -1 +1 -1

After convolution (w/ F2 ): -1 0 -1 0 -2 1 -1 0


Two-dimensional versions of filters like these are thought to be found in the visual
cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural
images. Computer vision people used to spend a lot of time hand-designing filter banks. A
filter bank is a set of sets of filters, arranged as shown in the diagram below.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 59

Image

All of the filters in the first group are applied to the original image; if there are k such
filters, then the result is k new images, which are called channels. Now imagine stacking
all these new images up so that we have a cube of data, indexed by the original row and
column indices of the image, as well as by the channel. The next set of filters in the filter
bank will generally be three-dimensional: each one will be applied to a sub-range of the row
and column indices of the image and to all of the channels.
These 3D chunks of data are called tensors. The algebra of tensors is fun, and a lot like We will use a popular
matrix algebra, but we won’t go into it in any detail. piece of neural-network
software called Tensor-
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters flow because it makes
in the first layer, f1 and f2 . You can think of each one as “looking” for three pixels in a operations on tensors
row, f1 vertically and f2 horizontally. Assuming our input image is n × n, then the result easy.
of filtering with these two filters an n × n × 2 tensor. Now we apply a tensor filter (hard
to draw!) that “looks for” a combination of two horizontal and two vertical bars (now
represented by individual pixels in the two channels), resulting in a single final n × n
image. When we have a color
image as input, we treat
it as having 3 channels,
and hence as an n×n×3
tensor.
f1

tensor
filter

f2

We are going to design neural networks that have this structure. Each “bank” of the
filter bank will correspond to a neural-network layer. The numbers in the individual filters
will be the “weights” of the network, which we will train using gradient descent. What
makes this interesting and powerful (and somewhat confusing at first) is that the same
weights are used many many times in the computation of each layer. This weight sharing
means that we can express a transformation on a large image with relatively few parame-
ters; it also means we’ll have to take care in figuring out exactly how to train it!
We will define a filter layer l formally with: For simplicity, we are
assuming that all im-
• number of filters ml ; ages and filters are
square (having the same
• size of filters kl × kl × ml−1 ; number of rows and
columns). That is in no
Last Updated: 02/01/19 01:38:02 way necessary, but is
usually fine and def-
initely simplifies our
notation.
MIT 6.036 Fall 2018 60

• stride sl is the spacing at which we apply the filter to the image; in all of our examples
so far, we have used a stride of 1, but if we were to “skip” and apply the filter only at
odd-numbered indices of the image, then it would have a stride of two (and produce
a resulting image of half the size);

• input tensor size nl−1 × nl−1 × ml−1

This layer will produces output tensor of size nl × nl × ml , where nl = bnl−1 /sl c. The
weights are the values defining the filter: there will be ml different kl × kl × ml−1 tensors
of weight values.
This may seem complicated, but we get a rich class of mappings that exploit image
structure and have many fewer weights than a fully connected layer would.
Study Question: How many weights are in a convolutional layer specified as
above?

Study Question: If we used a fully-connected layer with the same size inputs and
outputs, how many weights would it have?

2 Max Pooling
It is typical to structure filter banks into a pyramid, in which the image sizes get smaller in Both in engineering and
successive layers of processing. The idea is that we find local patterns, like bits of edges in nature
in the early layers, and then look for patterns in those patterns, etc. This means that, ef-
fectively, we are looking for patterns in larger pieces of the image as we apply successive
filters. Having a stride greater than one makes the images smaller, but does not necessarily
aggregate information over that spatial range.
Another common layer type, which accomplishes this aggregation, is max pooling. A
max pooling layer operates like a filter, but has no weights. You can think of it as a pure
functional layer, like a ReLU layer in a fully connected network. It has a filter size, as in a filter
layer, but simply returns the maximum value in its field. Usually, we apply max pooling We sometimes use the
with the following traits: term receptive field or
just field to mean the
• stride > 1, so that the resulting image is smaller than the input image; and area of an input image
that a filter is being ap-
• k > stride, so that the whole image is covered. plied to.

As a result of applying a max pooling layer, we don’t keep track of the precise location of a
pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer of stride = k = 2. This would map a 64 × 64 × 3 image to
a 32 × 32 × 3 image.
Study Question: Maximilian Poole thinks it would be a good idea to add two max
pooling layers of size k to their network. What single layer would be equivalent?

3 Typical architecture
Here is the form of a typical convolutional network:

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 61

Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-
network.html

After each filter layer there is generally a ReLU layer; there maybe be multiple fil-
ter/ReLU layers, then a max pooling layer, then some more filter/ReLU layers, then max
pooling. Once the output is down to a relatively small size, there is typically a last fully-
connected layer, leading into an activation function such as softmax that produces the final
output. The exact design of these structures is an art—there is not currently any clear the-
oretical (or even systematic empirical) understanding of how these various design choices
affect overall performance of the network.
The critical point for us is that this is all just a big neural network, which takes an input
and computes an output. The mapping is a differentiable function of the weights, which Well, the derivative is
means we can adjust the weights to decrease the loss by performing gradient descent, and not continuous, both be-
cause of the ReLU and
we can compute the relevant gradients using back-propagation! the max pooling oper-
Let’s work through a very simple example of how back-propagation can work on a con- ations, but we ignore
volutional network. The architecture is shown below. Assume we have a one-dimensional that fact.
single-channel image, of size n × 1 × 1 and a single k × 1 × 1 filter in the first convolutional
layer. Then we pass it through a ReLU layer and a fully-connected layer with no additional
activation function on the output.

conv ReLU fc

Z2 = A2
W1

pad with 0’s


(to get output 0
of same shape) Z1 A1
0
X=A

For simplicity assume k is odd, let the input image X = A0 , and assume we are using

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 62

squared loss. Then we can describe the forward pass as follows:


T
Z1i = W 1 · A0[i−bk/2c:i+bk/2c]
A1 = ReLU(Z1 )
T
A2 = W 2 A1
L(A2 , y) = (A2 − y)2

Study Question: For a filter of size k, how much padding do we need to add to the
top and bottom of the image?
How do we update the weights in filter W 1 ?

∂loss ∂Z1 ∂A1 ∂loss


= · ·
∂W 1 ∂W 1 ∂Z1 ∂A1
• ∂Z1 /∂W 1 is the k×n matrix such that ∂Z1i /∂Wj1 = Xi−bk/2c+j−1 . So, for example, if i =
10, which corresponds to column 10 in this matrix, which illustrates the dependence
of pixel 10 of the output image on the weights, and if k = 5, then the elements in
column 10 will be X8 , X9 , X10 , X11 , X12 .

• ∂A1 /∂Z1 is the n × n diagonal matrix such that



1 1 1 if Z1i > 0
∂Ai /∂Zi =
0 otherwise

• ∂loss/∂A1 = ∂loss/∂A2 · ∂A2 /∂A1 = 2(A2 − y)W 2 , an n × 1 vector

Multiplying these components yields the desired gradient, of shape k × 1.

Last Updated: 02/01/19 01:38:02


CHAPTER 10

Sequential models

So far, we have limited our attention to domains in which each output y is assumed to
have been generated as a function of an associated input x, and our hypotheses have been
“pure” functions, in which the output depends only on the input (and the parameters we
have learned that govern the function’s behavior). In the next few weeks, we are going to
consider cases in which our models need to go beyond functions.
• In recurrent neural networks, the hypothesis that we learn is not a function of a single
input, but of the whole sequence of inputs that the predictor has received.
• In reinforcement learning, the hypothesis is either a model of a domain (such as a game)
as a recurrent system or a policy which is a pure function, but whose loss is deter-
mined by the ways in which the policy interacts with the domain over time.
Before we engage with those forms of learning, we will study models of sequential or
recurrent systems that underlie the learning methods.

1 State machines
A state machine is a description of a process (computational, physical, economic) in terms This is such a pervasive
of its potential sequences of states. idea that it has been
given many names in
The state of a system is defined to be all you would need to now about the system to many subareas of com-
predict its future trajectories as well as possible. It could be the position and velocity of an puter science, control
object or the locations of your pieces on a board game, or the current traffic densities on a theory, physics, etc.,
highway network. including: automaton,
transducer, dynamical sys-
Formally, we define a state machine as (S, X, Y, s0 , f, g) where tem, system, etc.
• S is a finite or infinite set of possible states;
There are a huge num-
• X is a finite or infinite set of possible inputs; ber of major and minor
variations on the idea of
• Y is a finite or infinite set of possible outputs; a state machine. We’ll
just work with one spe-
• s0 ∈ S is the initial state of the machine; cific one in this section
and another one in the
• f : S × X → S is a transition function, which takes an input and a previous state and next, but don’t worry if
you see other variations
produces a next state; out in the world!

63
MIT 6.036 Fall 2018 64

• g : S → Y is an output function, which takes a state and produces an output.

The basic operation of the state machine is to start with state s0 , then iteratively com- In some cases, we will
pute: pick a starting state.

st = f(st−1 , xt )
yt = g(st )

The diagram below illustrates this process. Note that the “feedback” connection of
st back into f has to be buffered or delayed by one time step—-otherwise what it is
computing would not generally be well defined.

xt yt
f g

st−1

So, given a sequence of inputs x1 , x2 , . . . the machine generates a sequence of outputs

g(f(x1 , s0 )), g(f(x2 , f(x1 , s0 ))), . . . .


| {z } | {z }
y1 y2

We sometimes say that the machine transduces sequence x into sequence y. The output at
time t can have dependence on inputs from steps 1 to t.
One common form is finite state machines, in which S, X, Y are all finite sets. They are
often described using state transition diagrams such as the one below, in which nodes stand
for states and arcs indicate transitions. Nodes are labeled by which output they generate
and arcs are labeled by which input causes the transition. All computers can be
described, at the digital
One can verify that the state machine below reads binary strings and determines level, as finite state ma-
chines. Big, but finite!
the parity of the number of zeros in the given string. Check for yourself that all
inputted binary strings end in state S1 if and only if they contain an even number of
zeros.

Another common structure that is simple but powerful and used in signal processing
and control is linear time-invariant (LTI) systems. In this case, S = Rm , X = Rl and Y = Rn ,
and f and g are linear functions of their inputs. In discrete time, they can be defined by a
linear difference equation, like

y[t] = 3y[t − 1] + 6y[t − 2] + 5x[t] + 3x[t − 2] ,

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 65

and can be implemented using state to store relevant previous input and output informa-
tion.
We will study recurrent neural networks which are a lot like a non-linear version of an
LTI system, with transition and output functions
f(s, x) = f1 (W sx x + W ss s + W0ss )
g(s) = f2 (W 0 s + W00 )
defined by weight matrices
W sx : m × `
W ss : m × m
W0ss : m × 1
W0 : n × m
W00 : n × 1

and activation functions f1 and f2 . We will see that it’s actually possible to learn weight
values using gradient descent.

2 Markov decision processes


A Markov decision process (MDP) is a variation on a state machine in which:
• The transition function is stochastic, meaning that it defines a probability distribution Recall that stochastic
over the next state given the previous state and input, but each time it is evaluated it is another word for
probabilistic; we don’t
draws a new state from that distribution. say “random” because
that can be interpreted
• The output is equal to the state (that is g is the identity function).
in two ways, both of
• Some states (or state-action pairs) are more desirable than others. which are incorrect. We
don’t pick the transi-
An MDP can be used to model interaction with an outside “world,” such as a single-player tion function itself at
game. random from a distri-
bution. The transition
We will focus on the case in which S and X are finite, and will call the input set A for function doesn’t pick
actions (rather than X). The idea is that an agent (a robot or a game-player) can model its its output uniformly at
environment as an MDP and try to choose actions that will drive the process into states that random.
have high scores. There is an interest-
Formally, an MDP is hS, A, T , R, γi where: ing variation on MDPs,
called a partially observ-
• T : S × A × S → R is a transition model, where T (s, a, s 0 ) = P(St = s 0 |St−1 = s, At−1 = able MDP, in which the
a), specifying a conditional probability distribution; output is also drawn
from a distribution de-
• R : S × A → R is a reward function, where R(s, a) specifies how desirable it is to be in pending on the state.
state s and take action a; and
And there is an inter-
• γ ∈ [0, 1] is an optimal discount factor. esting, direct extension
to two-player zero-sum
A policy is a function π : S → A that specifies what action to take in each state. games, such as Chess
and Go.

2.1 Finite-horizon solutions The notation here uses


capital letters, like S to
Given an MDP our goal is typically to find a policy that is optimal in the sense that it gets stand for random vari-
ables and small letters
as much total reward as possible, in expectation over the stochastic transitions that the to stand for concrete
domain makes. In this section, we will consider the case where there is a finite horizon H, values. So St here is a
indicating the total number of steps of interaction that the agent will have with the MDP. random variable that
can take on elements of
S as values.
Last Updated: 02/01/19 01:38:02
MIT 6.036 Fall 2018 66

2.1.1 Evaluating a given policy


Before we can talk about how to find a good policy, we have to specify a measure of the
goodness of a policy. We will do so by defining the horizon h value of a state given a policy,
Vπh (s) for an MDP. We do this by induction on the horizon, which is the number of steps left
to go.
The base case is when there are no steps remaining, in which case, no matter what state
we’re in, the value is 0, so
Vπ0 (s) = 0 .
Then, the value of a policy in state s at horizon h + 1 is equal to the reward it will get in
state s plus the expected horizon h value of the next state. So, starting with horizons 1 and
2, and then moving to the general case, we have:

Vπ1 (s) = R(s, π(s)) + 0


X
Vπ2 (s) = R(s, π(s)) + T (s, π(s), s 0 ) · R(s 0 , π(s 0 ))
s0
..
.
X
Vπh (s) = R(s, π(s)) + T (s, π(s), s 0 ) · Vπh−1 (s 0 )
s0

The sum over s 0 is an expected value: it considers all possible next states s 0 , and computes
an average of their (h − 1)-horizon values, weighted by the probability that the transition
function from state s with the action chosen by the policy, π(s), assigns to arriving in state
s 0.
P
Study Question: What is s 0 T (s, a, s 0 ) for any particular s and a?
Then we can say that one policy is better than another one for horizon h, π1 >h π2 ,
if and only if for all s ∈ S, Vπh1 (s) > Vπh2 (s) and there exists at least one s ∈ S such that
Vπh1 (s) > Vπh2 (s).

2.1.2 Finding an optimal policy


How can we go about finding an optimal policy for an MDP? We could imagine enumerat-
ing all possible policies and calculating their value functions as in the previous section and
picking the best one...but that’s too much work!
The first observation to make is that, in a finite-horizon problem, the best action to take
depends on the current state, but also on the horizon: imagine that you are in a situation
where you could reach a state with reward 5 in one step or a state with reward 10 in two
steps. If you have at least two steps to go, then you’d move toward the reward-10 state,
but if you only have step left to go, you should go in the direction that will allow you to
gain 5!.
One way to find an optimal policy is to compute an optimal action-value function, Q. We
define Qh (s, a) to be the expected value of

• starting in state s

• executing action a

• continuing for h − 1 more steps executing an optimal policy for the appropriate hori-
zon on each step.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 67

Similar to our definition of V for evaluating a policy, we define the Q function recursively
according to the horizon. The only difference is that, on each step with horizon h, rather
than selecting an action specified by a given policy, we select the value of a that will maxi-
mize the expected Qh value of the next state.

Q0 (s, a) = 0
Q1 (s, a) = R(s, a) + 0
X
Q2 (s, a) = R(s, a) + T (s, a, s 0 ) max
0
R(s 0 , a 0 )
a
s0
..
.
X
Qh (s, a) = R(s, a) + T (s, a, s 0 ) max
0
Qh−1 (s 0 , a 0 )
a
s0

We can solve for the values of Q with a simple recursive algorithm called value iteration
which just computes Qh starting from horizon 0 and working backward to the desired
horizon H. Given Q, an optimal policy is easy to find:

π∗h (s) = arg max Qh (s, a) .


a

There may be multiple possible optimal policies.

2.2 Infinite-horizon solutions


It is actually more typical to work in a regime where the actual finite horizon is not known.
This is called the infinite horizon version of the problem, when you don’t know when the
game will be over! However, if we tried to simply take our definition of Qh above and set
h = ∞, we would be in trouble, because it could well be that the Q∞ values for all actions
would be infinite, and there would be no way to select one over the other.
There are two standard ways to deal with this problem. One is to take a kind of average
over all time steps, but this can be a little bit tricky to think about. We’ll take a different
approach, which is to consider the discounted infinite horizon. We select a discount fac-
tor 0 < γ < 1. Instead of trying to find a policy that maximizes expected finite-horizon
undiscounted value,
X
" h #
E Rt | π, s0
t=0

we will try to find one that maximizes expected infinite horizon discounted value, which is
"∞
X
#
γt Rt | π, S0 = E R0 + γR1 + γ2 R2 + . . . | π, s0 .
 
E
t=0

Note that the t indices here are not the number of steps to go, but actually the number
of steps forward from the starting state (there is no sensible notion of “steps to go” in the
infinite horizon case).
There are two good intuitive motivations for discounting. One is related to economic
theory and the present value of money: you’d generally rather have some money today
than that same amount of money next week (because you could use it now or invest it).
The other is to think of the whole process terminating, with probability 1 − γ on each step
of the interaction. This value is the expected amount of reward the agent would gain under
this terminating model.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 68

2.2.1 Evaluating a policy


We will start, again, by evaluating a policy, but now in terms of the expected discounted
infinite-horizon value that the agent will get in the MDP if it executes that policy. We define
the value of a state s under policy π as
Vπ (s) = E[R0 + γR1 + γ2 R2 + · · · | π, S0 = s] = E[R0 + γ(R1 + γ(R2 + γ . . . ))) | π, S0 = s] .
Because the expectation of a linear combination of random variables is the linear combina-
tion of the expectations, we have
Vπ (s) = E[R0 | π, S0 = s] + γE[R1 + γ(R2 + γ . . . ))) | π, S0 = s]
X
= R(s, π(s)) + γ T (s, π(s), s 0 )Vπ (s 0 )
s0
This is so cool! In a dis-
You could write down one of these equations for each of the n = |S| states. There are n counted model, if you
find that you survived
unknowns Vπ (s). These are linear equations, and so it’s easy to solve them using Gaussian this round and landed
elimination to find the value of each state under this policy. in some state s 0 , then
you have the same ex-
pected future lifetime as
2.2.2 Finding an optimal policy you did before. So the
value function that is
The best way of behaving in an infinite-horizon discounted MDP is not time-dependent: relevant in that state is
at every step, your expected future lifetime, given that you have survived until now, is exactly the same one as
1/(1 − γ). in state s.
An important theorem about MDPs is: there exists a stationary optimal policy π∗ (there
Stationary means that
may be more than one) such that for all s ∈ S and all other policies π we have it doesn’t change over
time; the optimal policy
Vπ∗ (s) > Vπ (s) . in a finite-horizon MDP
There are many methods for finding an optimal policy for an MDP. We will study a very is non-stationary.
popular and useful method called value iteration. It is also important to us, because it is the
basis of many reinforcement-learning methods.
Define Q∗ (s, a) to be the expected infinite horizon discounted value of being in state s,
executing action a, and executing an optimal policy π∗ thereafter. Using similar reasoning
to the recursive definition of Vπ , we can define
X
Q∗ (s, a) = R(s, a) + γ T (s, a, s 0 ) max
0
Q∗ (s 0 , a 0 ) .
a
s0

This is also a set of equations, one for each (s, a) pair. This time, though, they are not
linear, and so they are not easy to solve. But there is a theorem that says they have a
unique solution!
If we knew the optimal value function, then we could derive the optimal policy π∗ as
π∗ (s) = arg max Q∗ (s, a) .
a

We can iteratively solve for the Q values with the value iteration algorithm, shown
below:

VALUE -I TERATION(S, A, T , R, γ, )
1 for s ∈ S, a ∈ A :
2 Qold (s, a) = 0
3 while True:
4 for s ∈ S, a ∈ A :
P
5 Qnew (s, a) = R(s, a) + γ s 0 T (s, a, s 0 ) maxa 0 Qold (s 0 , a 0 )
6 if maxs,a |Qold (s, a) − Qnew (s, a)| <  :
7 return Qnew
8 Qold := Qnew

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 69

2.2.3 Theory
There are a lot of nice theoretical results about value iteration. For some given (not neces-
sarily optimal) Q function, define πQ (s) = arg maxa Q(s, a).

• After executing value iteration with parameter , kVπQnew − Vπ∗ kmax < . This is new notation!
Given two functions
• There is a value of  such that f and f 0 , we write
kf − f 0 kmax to mean
kQold − Qnew kmax <  =⇒ πQnew = π∗ maxx |f(x) − f 0 (x)|. It
measures the maximum
absolute point of dis-
• As the algorithm executes, kVπQnew − Vπ∗ kmax decreases monotonically on each itera- agreement between the
tion. two functions.

• The algorithm can be executed asynchronously, in parallel: as long as all (s, a) pairs
are updated infinitely often in an infinite run, it still converges to optimal value. This is very important
for reinforcement learn-
ing.

Last Updated: 02/01/19 01:38:02


CHAPTER 11

Reinforcement learning

So far, all the learning problems we have looked at have been supervised: that is, for each
training input x(i) , we are told which value y(i) should be the output. A very different
problem setting is reinforcement learning, in which the learning system is not directly told
which outputs go with which inputs. Instead, there is an interaction of the form:
• Learner observes input s(i)
• Learner generates output a(i)
• Learner observes reward r(i)
• Learner observes input s(i+1)
• Learner generates output a(i+1)
• Learner observes reward r(i+1)
• ...
The learner is supposed to find a policy, mapping s to a, that maximizes expected reward
over time.

Learner
state reward action

Environment
This problem setting is equivalent to an online supervised learning under the following
assumptions:
1. The space of possible outputs is binary (e.g. {+1, −1}) and the space of possible re-
wards is binary (e.g. {+1, −1});
2. s(i) is independent of all previous s(j) and a(j) ; and
3. r(i) depends only on s(i) and a(i) .
In this case, for any experience tuple (s(i) , a(i) , r(i) ), we can generate a supervised training
example, which is equal to (s(i) , a(i) ) if r(i) = +1 and (s(i) , −a(i) ) otherwise.
Study Question: What supervised-learning loss function would this objective corre-
spond to?

70
MIT 6.036 Fall 2018 71

Reinforcement learning is more interesting when these properties do not hold. When
we relax assumption 1 above, we have the class of bandit problems, which we will discuss
in section 1. If we relax assumption 2, but assume that the environment that the agent
is interacting with is an MDP, so that s(i) depends only on s(i−1) and a(i−1) then we are
in the classical reinforcement-learning setting, which we discuss in section 2. Weakening
the assumptions further, for instance, not allowing the learner to observe the current state
completely and correctly, makes the problem substantially more difficult, and beyond the
scope of this class.

1 Bandit problems
A basic bandit problem is given by

• A set of actions A;

• A set of reward values R; and

• A probabilistic reward function R : A → R where R(a) = P(R | A = a) is a probability


distribution over possible reward values in R conditioned on which action is selected.

The most typical bandit problem has R = {0, 1} and |A| = k. This is called a k-armed
bandit problem. There is a lot of mathematical literature on optimal strategies for k-armed Why? Because in En-
bandit problems under various assumptions. The important question is usually one of glish slang, “one-armed
bandit” is a name for
exploration versus exploitation. Imagine that you have tried each action 10 times, and now a slot machine (an old-
you have an estimate p̂j for the expected value of R(aj ). Which arm should you pick next? style gambling machine
You could where you put a coin
into a slot and then pull
exploit your knowledge, and choose the arm with the highest value of p̂j on all future its arm to see if you get
trials; or a payoff.) because it has
one arm and takes your
money! What we have
explore further, by trying some or all actions more times, hoping to get better estimates here is a similar sort
of the pj values. of machine, but with k
arms.
The theory ultimately tells us that, the longer our horizon H (or, similarly, closer to 1 our
discount factor), the more time we should spend exploring, so that we don’t converge
prematurely on a bad choice of action.
Study Question: Why is it that “bad” luck during exploration is more dangerous
than “good” luck? Imagine that there is an action that generates reward value 1 with
probability 0.9, but the first three times you try it, it generates value 0. How might
that cause difficulty? Why is this more dangerous than the situation when an action
that generates reward value 1 with probability 0.1 actually generates reward 1 on the
first three tries?
Note that what makes this a very different kind of problem from the batch supervised There is a setting of su-
learning setting is that: pervised learning, called
active learning, where in-
• The agent gets to influence what data it gets (selecting aj gives it another sample stead of being given a
training set, the learner
from rj ), and gets to select values of
x and the environment
• The agent is penalized for mistakes it makes while it is learning (if it is trying to gives back a label y;
maximize the expected sum of rt it gets while behaving). the problem of picking
good x values to query
In a contextual bandit problem, you have multiple possible states, drawn from some set is interesting, but the
S, and a separate bandit problem associated with each one. problem of deriving a
hypothesis from (x, y)
Bandit problems will be an essential sub-component of reinforcement learning. pairs is the same as the
supervised problem we
Last Updated: 02/01/19 01:38:02 have been studying.
MIT 6.036 Fall 2018 72

2 Sequential problems
In the more typical (and difficult!) case, we can think of our learning agent interacting with
an MDP, where it knows S and A, but not T (s, a, s 0 ) or R(s, a). The learner can interact
with the environment by selecting actions. So, this is somewhat like a contextual bandit
problem, but more complicated, because selecting an action influences not only what the
immediate reward will be, but also what state the system ends up in at the next time step
and, therefore, what additional rewards might be available in the future.
A reinforcement-learning RL algorithm is a kind of a policy that depends on the whole
history of states, actions, and rewards and selects the next action to take. There are several
different ways to measure the quality of an RL algorithm, including:
• Ignoring the rt values that it gets while learning, but consider how many interactions
with the environment are required for it to learn a policy π : S → A that is nearly
optimal.

• Maximizing the expected discounted sum of total rewards while it is learning.


Most of the focus is on the first criterion, because the second one is very difficult. The first
criterion is reasonable when the learning can take place somewhere safe (imagine a robot
learning, inside the robot factory, where it can’t hurt itself too badly) or in a simulated
environment.
Approaches to reinforcement-learning differ significantly according to what kind of
hypothesis or model they learn. In the following sections, we will consider several different
approaches.

2.1 Model-based RL
The conceptually simplest approach to RL is to estimate R and T from the data we have got-
ten so far, and then use those estimates, together with an algorithm for solving MDPs (such
as value iteration) to find a policy that is near-optimal given the current model estimates.
Assume that we have had some set of interactions with the environment, which can be
characterized as a set of tuples of the form (s(t) , a(t) , r(t) , s(t+1) )).
We can estimate T (s, a, s 0 ) using a simple counting strategy,

#(s, a, s 0 ) + 1
T̂ (s, a, s 0 ) = .
#(s, a) + |S|

Here, #(s, a, s 0 ) represents the number of times in our data set we have the situation where
st = s, at = a, st+1 = s 0 and #(s, a) represents the number of times in our data set we have
the situation where st = s, at = a.
P
Study Question: Prove to yourself that #(s, a) = s 0 #(s, a, s 0 ).
Adding 1 and |S| to the numerator and denominator, respectively, are a form of smooth-
ing called the Laplace correction. It ensures that we never estimate that a probability is 0,
and keeps us from dividing by 0. As the amount of data we gather increases, the influence
of this correction fades away.
We also estimate the reward function R(s, a):
P
r | s, a
R̂(s, a) =
#(s, a)
where X X
r | s, a = r(t) .
{t|st =s,at =a}

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 73

This is just the average of the observed rewards for each s, a pair.
We can now solve the MDP (S, A, T̂ , R̂) to find an optimal policy using value iteration,
or use a finite-depth expecti-max search to find an action to take for a particular state.
This technique is effective for problems with small state and action spaces, where it is
not too hard to get enough experience to estimate T and R well but it is difficult to general-
ize to continuous (or very large discrete) state spaces, and is a topic of current research.

2.2 Policy search


A very different strategy is to search directly for a good policy, without estimating the
model. The strategy here is to define a functional form f(s; θ) = a for the policy, where θ
represents the parameters we learn from experience. We choose f to be differentiable, and
often let f(s; θ) = P(a), a probability distribution over our possible actions.
Now, we can train the policy parameters using gradient descent:

• When θ has relatively low dimension, we can compute a numeric estimate of the
gradient by running the policy multiple times for θ ± , and computing the resulting
rewards.

• When θ has higher dimensions (e.g., it is a complicated neural network), there are
more clever algorithms, e.g., one called REINFORCE, but they can often be difficult to
get to work reliably.

Policy search is a good choice when the policy has a simple known form, but the model
would be much more complicated to estimate.

2.3 Value function learning


The most popular class of algorithms learns neither explicit transition and reward models
nor a direct policy, but instead concentrates on learning a value function. It is a topic of
current research to describe exactly under what circumstances value-function-based ap-
proaches are best, and there are a growing number of methods that combine value func-
tions, transition and reward models and policies into a complex learning algorithm in an
attempt to combine the strengths of each approach.
We will study two variations on value-function learning, both of which estimate the Q
function.

2.3.1 Q-learning
This is the most typical way of performing reinforcement learning. Recall the value-iteration
update: X The thing that most stu-
Q(s, a) = R(s, a) + γ 0
T (s, a, s ) max 0 0
Q(s , a ) dents seem to get con-
a 0 fused about is when we
s0 do value iteration and
We will adapt this update to the RL scenario, where we do not know the transition function when we do Q learning.
Value iteration assumes
T or reward function R.
you know T and R and
just need to compute Q.
In Q learning, we don’t
know or even directly
estimate T and R: we
estimate Q directly from
experience!

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 74

Q-L EARNING(S, A, s0 , γ, α)
1 for s ∈ S, a ∈ A :
2 Q[s, a] = 0
3 s = s0 // Or draw an s randomly from S
4 while True:
5 a = select_action(s, Q)
6 r, s 0 = execute(a)
7 Q[s, a] = (1 − α)Q[s, a] + α(r + γ maxa 0 Q[s 0 , a 0 ])
8 s = s0

Here, α represents the “learning rate,” which needs to decay for convergence purposes,
but in practice is often set to a constant.
Note that the update can be rewritten as
 
0 0
Q[s, a] = Q[s, a] − α Q[s, a] − (r + γ max
0
Q[s , a ] ,
a

which looks something like a gradient update! This is often called temporal difference learn- It is actually not a gra-
ing method, because we make an update based on the difference between the current es- dient update, but later,
when we consider func-
timated value of taking action a in state s, which is Q[s, a], and the “one-step” sampled tion approximation, we
value of taking a in s, which is r + γ maxa 0 Q[s 0 , a 0 ]. will treat it as if it were.
You can see this method as a combination of two different iterative processes that we
have already seen: the combination of an old estimate with a new sample using a running
average with a learning rate α, and the dynamic-programming update of a Q value from
value iteration.
Our algorithm above includes a procedure called select_action, which, given the current
state s, has to decide which action to take. If the Q value is estimated very accurately and
the agent is behaving in the world, then generally we would want to choose the apparently
optimal action arg maxa∈A Q(s, a). But, during learning, the Q value estimates won’t be
very good and exploration is important. However, exploring completely at random is also
usually not the best strategy while learning, because it is good to focus your attention on
the parts of the state space that are likely to be visited when executing a good policy (not a
stupid one).
A typical action-selection strategy is the -greedy strategy:

• with probability 1 − , choose arg maxa∈A Q(s, a)

• with probability , choose the action a ∈ A uniformly at random

Q-learning has the surprising property that it is guaranteed to converge to the actual
optimal Q function under fairly weak conditions! Any exploration strategy is okay as
long as it tries every action infinitely often on an infinite run (so that it doesn’t converge
prematurely to a bad action choice).
Q-learning can be very sample-inefficient: imagine a robot that has a choice between
moving to the left and getting a reward of 1, then returning to its initial state, or moving
to the right and walking down a 10-step hallway in order to get a reward of 1000, then
returning to its initial state.

+1 +1000

-1 robot 1 2 3 4 5 6 7 8 9 10

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 75

The first time the robot moves to the right and goes down the hallway, it will update the
Q value for the last state on the hallway to have a high value, but it won’t yet understand
that moving to the right was a good choice. The next time it moves down the hallway
it updates the value of the state before the last one, and so on. After 10 trips down the
hallway, it now can see that it is better to move to the right than to the left.
More concretely, consider the vector of Q values Q(0 : 10, right), representing the Q
values for moving right at each of the positions 0, . . . , 9. Then, for α = 1 and γ = 0.9,

Q(i, right) = R(i, right) + 0.9 · max Q(i + 1, a)


a

Starting with Q values of 0,

Q(0) (0 : 10, right) = 0


 
0 0 0 0 0 0 0 0 0 0

Since the only nonzero reward from moving right is R(9, right) = 1000, after our robot We are violating our
makes it down the hallway once, our new Q vector is usual notational con-
ventions here, and writ-
ing Q(i) to mean the
Q(1) (0 : 10, right) = 0 0 0 0 0 0 0 0 0 1000 0
 
Q value function that
results after the robot
After making its way down the hallway again, Q(8, right) = 0 + 0.9 · Q(9, right) = 900 runs all the way to
updates: the end of the hallway,
Q(2) (0 : 10, right) = 0 0 0 0 0 0 0 0 900 1000 0 when executing the pol-
 
icy that always moves
Similarly, to the right.

Q(3) (0 : 10, right) = 0 0 0 0 0 0 0 810 900 1000 0


 

Q(4) (0 : 10, right) = 0 0 0 0 0 0 729 810 900 1000 0


 

..
.
(10)
 
Q (0 : 10, right) = 387.4 420.5 478.3 531.4 590.5 656.1 729 810 900 1000 0 ,

and the robot finally sees the value of moving right from position 0. We can see how this
interacts with the ex-
Study Question: Determine the Q value functions that will result from updates due ploration/exploitation
to the robot always executing the “move left” policy. dilemma: from the per-
spective of s0 , it will
seem, for a long time,
that getting the immedi-
2.3.2 Function approximation ate reward of 1 is a bet-
ter idea, and it would
In our Q-learning algorithm above, we essentially keep track of each Q value in a table, be easy to converge on
indexed by s and a. What do we do if S and/or A are large (or continuous)? that as a strategy with-
We can use a function approximator like a neural network to store Q values. For exam- out exploring the long
ple, we could design a neural network that takes in inputs s and a, and outputs Q(s, a). hallway sufficiently.
We can treat this as a regression problem, optimizing the squared Bellman error, with loss:
 2
0 0
Q(s, a) − (r + γ max
0
Q(s , a )) ,
a

where Q(s, a) is now the output of the neural network.


There are actually several different architectural choices for using a neural network to
approximate Q values:

• One network for each action aj , that takes s as input and produces Q(s, aj ) as output;

• One single network that takes s as input and produces a vector Q(s, ·), consisting of
the Q values for each action; or

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 76

• One single network that takes s, a concatenated into a vector (if a is discrete, we
would probably use a one-hot encoding, unless it had some useful internal structure)
and produces Q(s, a) as output.

For continuous action


The first two choices are only suitable for discrete (and not too big) action sets. The last spaces, it is increasingly
popular to use a class
choice can be applied for continuous actions, but then it is difficult to find arg maxA Q(s, a). of methods called actor-
There are not many theoretical guarantees about Q-learning with function approxima- critic methods, which
tion and, indeed, it can sometimes be fairly unstable (learning to perform well for a while, combine policy and
and then getting suddenly worse, for example). But it has also had some significant suc- value-function learning.
We won’t get into them
cesses. in detail here, though.
One form of instability that we do know how to guard against is catastrophic forgetting.
In standard supervised learning, we expect that the training x values were drawn inde-
pendently from some distribution. But when a learning agent, such as a robot, is moving And, in fact, we rou-
through an environment, the sequence of states it encounters will be temporally correlated. tinely shuffle their order
in the data file, anyway.
This can mean that while it is in the dark, the neural-network weight-updates will make
the Q function “forget” the value function for when it’s light. We can handle this with For example, it might
experience replay, where we save our (s, a, r, s 0 ) experiences in a buffer and mix into our spend 12 hours in a
learning process some updates from the historic buffer. dark environment and
then 12 in a light one.

2.3.3 Fitted Q-learning


An alternative strategy for learning the Q function that is somewhat more robust than the
standard Q-learning algorithm is a method called fitted Q.

F ITTED -Q-L EARNING(A, s0 , γ, α, , m)


1 s = s0 // Or draw an s randomly from S
2 D={}
3 initialize neural-network representation of Q
4 while True:
5 Dnew = experience from executing -greedy policy based on Q for m steps
6 D = D ∪ Dnew represented as (s, a, r, s 0 ) tuples
7 Dsup = {(x(i) , y(i) )} where x(i) = (s, a) and y(i) = r + γ maxa 0 ∈A Q(s 0 , a 0 )
8 for each tuple (s, a, r, s 0 )(i) ∈ D
9 re-initialize neural-network representation of Q
10 Q = supervised_NN_regression(Dsup )

Here, we alternate between using the policy induced by the current Q function to gather
a batch of data Dnew , adding it to our overall data set D, and then using supervised neural-
network training to learn a representation of the Q value function on the whole data set.
This method does not mix the dynamic-programming phase (computing new Q values
based on old ones) with the function approximation phase (training the neural network)
and avoids catastrophic forgetting. The regression training in line 9 typically uses squared
error as a loss function and would be trained until the fit is good (possibly measured on
held-out data).

Last Updated: 02/01/19 01:38:02


CHAPTER 12

Recurrent Neural Networks

In chapter 8 we studied neural networks and how we can train the weights of a network,
based on data, so that it will adapt into a function that approximates the relationship be-
tween the (x, y) pairs in a supervised-learning training set. In section 1 of chapter 10, we
studied state-machine models and defined recurrent neural networks (RNNs) as a particular
type of state machine, with a multidimensional vector of real values as the state. In this
chapter, we’ll see how to use gradient-descent methods to train the weights of an RNN so
that it performs a transduction that matches as closely as possible a training set of input-
output sequences.

1 RNN model
Recall that the basic operation of the state machine is to start with some state s0 , then
iteratively compute:

st = f(st−1 , xt )
yt = g(st )

as illustrated in the diagram below (remembering that there needs to be a delay on the
feedback loop):

xt yt
f g

st−1

So, given a sequence of inputs x1 , x2 , . . . the machine generates a sequence of outputs

g(f(x1 , s0 )), g(f(x2 , f(x1 , s0 ))), . . . .


| {z } | {z }
y1 y2

77
MIT 6.036 Fall 2018 78

A recurrent neural network is a state machine with neural networks constituting functions
f and g:

f(s, x) = f1 (W sx x + W ss s + W0ss )
g(s) = f2 (W O s + W0O )

. The inputs, outputs, and states are all vector-valued: We are very sorry! This
course material has
xt : ` × 1 evolved from different
sources, which used
st : m × 1 W T x in the forward
yt : v × 1 . pass for regular feed-
forward NNs and Wx
for the forward pass
in RNNs. I only just
The weights in the network, then, are noticed this inconsis-
tency and it’s too late to
W sx : m × ` change for this term. Of
course it doesn’t make
W ss : m × m any technical difference,
W0ss : m × 1 but is a potential source
of confusion.
WO : v × m
W0O : v × 1

with activation functions f1 and f2 . Finally, the operation of the RNN is described by

st = f1 (W sx xt + W ss st−1 + W0 )
yt = f2 W O st + W0O .


Study Question: Check dimensions here to be sure it all works out. Remember that
we apply f1 and f2 elementwise.

2 Sequence-to-sequence RNN
Now, how can we train an RNN to model a transduction on sequences? This problem is
sometimes called sequence-to-sequence mapping. You can think of it as a kind of regression
problem: given an input sequence, learn to
 generate the corresponding output sequence. One way to think of
training a sequence
 
A training set has the form x(1) , y(1) , . . . , x(q) , y(q) , where
classifier is to reduce it
• x(1) and y(1) are length n(q) sequences; to a transduction prob-
lem, where yt = 1 if the
• sequences in the same pair are the same length; and sequences in different pairs may sequence x1 , . . . , xt is a
positive example of the
have different lengths. class of sequences and
−1 otherwise.
Next, we need a loss function. We start by defining a loss function on sequences. There
are many possible choices, but usually it makes sense just to sum up a per-element loss
function on each of the output values:

  nX
(q)
 
(i) (i) (i) (i)
Lossseq p , y = Losselt pj , yj .
j=1

The per-element loss function Losselt will depend on the type of yt and what informa-
tion it is encoding, in the same way as for a supervised network.. Then, letting θ = So it could be NLL,
hinge loss, squared loss,
etc.
Last Updated: 02/01/19 01:38:02
MIT 6.036 Fall 2018 79


W sx , W ss , W O , W0 , W0O , our overall objective is to minimize

X
q  
J(θ) = Lossseq RNN(x(i) ; θ), y(i) ,
i=1

where RNN(x; θ) is the output sequence generated, given input sequence x.


It is typical to choose f1 to be tanh but any non-linear activation function is usable. We Remember that it looks
choose f2 to align with the types of our outputs and the loss function, just as we would do like a sigmoid but
ranges from -1 to +1.
in regular supervised learning.

3 Back-propagation through time


Now the fun begins! We can find θ to minimize J using gradient descent. We will work
through the simplest method, back-propagation through time (BPTT), in detail. This is gener-
ally not the best method to use, but it’s relatively easy to understand. In section 5 we will
sketch alternative methods that are in much more common use.

Calculus reminder Most of us are not very careful about the difference between the
partial derivative and the total derivative. We are going to use a nice example from the
Wikipedia article on partial derivatives to illustrate the difference.
The volume of a cone depends on its height and radius:

πr2 h
V(r, h) = .
3
The partial derivatives of volume with respect to height and radius are

∂V 2πrh ∂V πr2
= and = .
∂r 3 ∂h 3
They measure the change in V assuming everything is held constant except the sin-
gle variable we are changing. But! in a cone, the radius and height are not indepen-
dent, and so we can’t really change one without changing the other. In this case, we
really have to think about the total derivative, which sums the “paths” along which
r might influence V:

dV ∂V ∂V dh
= +
dr ∂r ∂h dr
2πrh πr2 dh
= +
3 3 dr
dV ∂V ∂V dr
= +
dh ∂h ∂r dh
2
πr 2πrh dr
= +
3 3 dh
Just to be completely concrete, let’s think of a right circular cone with a fixed angle
α = tan r/h, so that if we change r or h then α remains constant. So we have
r = h tan − 1α; let constant c = tan−1 α, so now r = ch. Now, we know that

dV 2πrh πr2 1
= +
dr 3 3 c
2
dV πr 2πrh
= + c
dh 3 3

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 80

The BPTT process goes like this:

(1) Sample a training pair of sequences (x, y); let their length be n.

(2) “Unroll" the RNN to be length n (picture for n = 3 below), and initialize s0 :

Now, we can see our problem as one of performing what is almost an ordinary back-
propagation training procedure in a feed-forward neural network, but with the dif-
ference that the weight matrices are shared among the layers. In many ways, this is
similar to what ends up happening in an convolutional network, except in the conv-
net, the weights are re-used spatially, and here, they are re-used temporally.

(3) Do the forward pass, to compute the predicted output sequence p:

z1t = W sx xt + W ss st−1 + W0
st = f1 (z1t )
z2t = W O st + W0O
pt = f2 (z2t )

(4) Do backward pass to compute the gradients. For both W ss and W sx we need to find

dLseq X
n
dLu
=
dW dW
u=1

Letting Lu = Lelt (pu , yu ) and using the total derivative, which is a sum over all the
ways in which W affects Lu , we have

X
n X
n
∂Lu ∂st
= ·
∂st ∂W
u=1 t=1

Re-organizing, we have

X
n
∂st X ∂Lu
n
= ·
∂W ∂st
t=1 u=1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 81

Because st only affects Lt , Lt+1 , . . . , Ln ,

Xn
∂st X ∂Lu
n
= ·
∂W u=t ∂st
t=1
 

X
n
∂st

 ∂Lt X
n 
∂Lu 
= ·
 +  (12.1)
∂W  ∂st u=t+1 ∂st 

t=1
| {z }
δst

δst is the dependence of the loss on steps after t on the state at time t. That is, δst is how
much we can blame
We can compute this backwards, with t going from n down to 1. The trickiest part is state st for all the future
figuring out how early states contribute to later losses. We define future loss element losses.

X
n
Ft = Losselt (pu , yu ) ,
u=t+1

so
∂Ft
δst = .
∂st
At the last stage, Fn = 0 so δsn = 0.
Now, working backwards,

∂ X
n
st−1
δ = Losselt (pu , yu )
∂st−1 u=t
∂ X
n
∂st
= · Losselt (pu , yu )
∂st−1 ∂st u=t
X
" n
#
∂st ∂
= · Losselt (pt , yt ) + Losselt (pu , yu )
∂st−1 ∂st
u=t+1
 
∂st ∂Losselt (pt , yt )
= · + δst
∂st−1 ∂st

Now, we can use the chain rule again to find the dependence of the element loss at
time t on the state at that same time,

∂Losselt (pt , yt ) ∂z2t ∂Losselt (pt , yt )


= · ,
∂st ∂st ∂z2t
| {z } |{z} | {z }
(m×1) (m×v) (v×1)

and the dependence of the state at time t on the state at the previous time, noting that
0
we are performing an elementwise multiplication between Wss T
and the vector of f1
values, ∂st /∂z1t : There are two ways to
∂st ∂z1t ∂st 0 think about ∂st /∂zt :
= · = WssT
∗ f1 (z1t ) . here, we take the view
∂st−1 ∂st−1 ∂z1t | {z }
| {z } | {z } |{z} not dot!
that it is an m × 1 vector
(m×m) (m×m) (m×1) and we multiply each
column of W T by it.
Putting this all together, we end up with Another, equally good,
  view, is that it is an m ×
0 ∂Lt m diagonal matrix, with
δst−1 = Wss
T
∗ f1 (z1t ) · W0T 2 + δst
| {z } ∂zt the values along the
∂st | {z } diagonal, and then this
∂st−1 ∂Ft−1
∂st
operation is a matrix
multiply. Our software
implementation will
Last Updated: 02/01/19 01:38:02 take the first view.
MIT 6.036 Fall 2018 82

We’re almost there! Now, we can describe the actual weight updates. Using equa-
tion 12.1 and recalling the definition of δst = ∂Ft /∂st , as we iterate backwards, we
can accumulate the terms in equation 12.1 to get the gradient for the whole loss:

dLseq ∂Ft−1 ∂z1t ∂st ∂Ft−1


+= =
dW ss ∂W ss ∂W ss ∂z1t ∂st
dLseq ∂Ft−1 ∂z1t ∂st ∂Ft−1
+= =
dW sx ∂W sx ∂W sx ∂z1t ∂st

We can handle W O separately; it’s easier because it does not effect future losses in
the way that the other weight matrices do:

∂Lseq X ∂Lt
n X ∂Lt ∂z2 n
t
= = ·
∂W O ∂W O
t=1
∂z2t ∂W O t=1

Assuming we have ∂L t
∂z2t
= (pt − yt ), (which ends up being true for squared loss,
softmax-NLL, etc.), then on each iteration
∂Lseq
+ = (pt − yt ) · sTt
∂W O
| {z } | {z } |{z}
v×1 1×m
v×m

Whew!

Study Question: Derive the updates for the offsets W0 and W0O .

4 Training a language model


(i) (i) (i)
A language model is just trained on a set of input sequences, (c1 , c2 , . . . , cni ), and is used
to predict the next character, given a sequence of previous tokens: A “token” is generally a
character or a word.
ct = RNN(c1 , c2 , . . . , ct−1 )

We can convert this to a sequence-to-sequence training problem by constructing a data


set of (x, y) sequence pairs, where we make up new special tokens, start and end, to signal
the beginning and end of the sequence:

x = (hstarti, c1 , c2 , ,̇cn )
y = (c1 , c2 , . . . , hendi)

5 Vanishing gradients and gating mechanisms


Let’s take a careful look at the backward propagation of the gradient along the sequence:
 
∂st ∂Losselt (pt , yt )
δst−1 = · + δst
∂st−1 ∂st
Consider a case where only the output at the end of the sequence is incorrect, but it depends
critically, via the weights, on the input at time 1. In this case, we will multiply the loss at
step n by
∂s2 ∂s3 ∂sn
· ··· .
∂s1 ∂s2 ∂sn−1

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 83

In general, this quantity will either grow or shrink exponentially with the length of the
sequence, and make it very difficult to train.
Study Question: The last time we talked about exploding and vanishing gradients, it
was to justify per-weight adaptive step sizes. Why is that not a solution to the prob-
lem this time?
An important insight that really made recurrent networks work well on long sequences
was the idea of gating.

5.1 Simple gated recurrent networks


A computer only ever updates some parts of its memory on each computation cycle. We
can take this idea and use it to make our networks more able to retain state values over time
and to make the gradients better-behaved. We will add a new component to our network,
called a gating network. Let gt be a m × 1 vector of values and let W gx and W gs be m × l
and m × m weight matrices, respectively. We will compute gt as It can have an offset,
too, but we are omitting
gt = sigmoid(W gx xt + W gs st−1 ) it for simplicity.

and then change the computation of st to be

st = (1 − gt ) ∗ st−1 + gt ∗ f1 (W sx xt + W ss st−1 + W0 ) ,

where ∗ is component-wise multiplication. We can see, here, that the output of the gating
network is deciding, for each dimension of the state, how much it should be updated now.
This mechanism makes it much easier for the network to learn to, for example, “store”
some information in some dimension of the state, and then not change it during future
state updates.
Study Question: Why is it important that the activation function for g be a sigmoid?

5.2 Long short-term memory


The idea of gating networks can be applied to make a state-machine that is even more like
a computer memory, resulting in a type of network called an LSTM for “long short-term
memory.” We won’t go into the details here, but the basic idea is that there is a memory Yet another awesome
cell (really, our state vector) and three (!) gating networks. The input gate selects (using name for a neural net-
work!
a “soft” selection as in the gated network above) which dimensions of the state will be
updated with new values; the forget gate decides which dimensions of the state will have
its old values moved toward 0, and the output gate decides which dimensions of the state
will be used to compute the output value. These networks have been used in applications
like language translation with really amazing results. A diagram of the architecture is
shown below:

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 84

Last Updated: 02/01/19 01:38:02


CHAPTER 13

Recommender systems

The problem of choosing items from a large set to recommend to a user comes up in many
contexts, including music services, shopping, and online advertisements. As well as be-
ing an important application, it is interesting because it has several formulations, some of
which take advantage of a particular interesting structure in the problem.
Concretely, we can think about a company like Netflix, which recommends movies to its
users. Netflix knows the ratings given by many different people to many different movies,
and your ratings on a small subset of all possible movies. How should it use this data to
recommend a movie for you to watch tonight?
There are two prevailing approaches to this problem. The first, content-based recom-
mendation, is formulated as a supervised learning problem. The second, collaborative
filtering, introduces a new learning problem formulation.

1 Content-based recommendations
In content-based recommendation, we try to learn a predictor, f, that uses the movies that
you have rated so far as training data, finds a hypothesis that maps a movie into a predic-
tion of what rating you would give it, and then returns some movies with high predicted
ratings.
The first step is designing representations for the input and output.
It’s actually pretty difficult to design a good feature representation for movies. Reason-
able approaches might construct features based on the movie’s genre, length, main actors,
director, location, or even ratings given by some standard critics or aggregation sources.
This design process would yield

φ : movie → vector .

Movie ratings are generally given in terms of some number of stars, so the output do-
main might be {1, 2, 3, 4, 5}. It’s not appropriate for one-hot encoding on the output, and
pretending that these are real values is also not entirely sensible. Nevertheless, we will
treat the output as if it’s in R. Thermometer coding
might be reasonable,
Study Question: Why is one-hot not a good choice? Why is R not a good choice? but it’s hard to say
without trying it. Some
more advanced tech-
niques try to predict
85 rankings (would I pre-
fer movie A over movie
B) rather than raw rat-
ings.
MIT 6.036 Fall 2018 86

Now that we have an encoding, we can make a training set based on your previous
ratings of movies

   
φ(m(1) ), rating(m(1) ) , φ(m(2) ), rating(m(2) ) , . . .

The next step is to pick a loss function. This is closely related to the choice of output
encoding. Since we decided to treat the output as a real, we can formulate the problem as
a regression from φ → R, with Loss(p, y) = 12 (y − p)2 We will generally need to regularize
because we typically have a very small amount of data (unless you really watch a lot of
movies!).
Finally, we need to pick a hypothesis space. The simplest thing would be to make it
linear, but you could definitely use something fancier, like a neural network.
If we put all this together, with a linear hypothesis space, we end up with the objective

1 X (i) λ
J(θ) = (y − θT x(i) − θ0 )2 + kθk2 .
2 2
i∈Da

This is our old friend, ridge regression, and can be solved analytically or with gradient
descent.

2 Collaborative filtering
There are two difficulties with content-based recommendation systems:

• It’s hard to design a good feature set to represent movies.

• They only use your previous movie ratings, but don’t have a way to use the rest of
their data.

In collaborative filtering, we’ll try to use all the ratings that other people have made of
movies to help make better predictions for you.
Intuitively, we can see this process as finding the kinds of people who like the kinds of
movies I like, and then predicting that I will like other movies that they like. In fact, there’s a third
Formally, we will start by constructing a data matrix Y, where Yai represents the score strategy that is re-
ally directly based on
given by user a to movie i. So, if we have n users and m movies, Y has shape n × m. this idea, in which we
concretely try to find
other users who are
our “nearest neighbors”
in movie preferences,
and then predict movies
they like. The approach
we discuss here has
similar motivations but
is more robust.

We will in fact not actu-


ally represent the whole
data matrix explicitly—
it would be too big.
But it’s useful to think
about.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 87

m movies
5 3
1
2 ···
n users

..
.

Y is very sparse (most entries are empty). So, we will think of our training data-set as In the Netflix chal-
D = {(a, i, r)}, a set of tuples, where a is the index assigned to a particular user, i is the lenge data set, there a re
400,000 users and 17,000
index assigned to a particular movie, and r is user a’s rating of movie i. movies. Only 1% of the
We are going to try to find a way to use D to predict values for missing entries. Let X data matrix is filled.
be our predicted matrix of ratings. Now, we need to find a loss function that relates X and
Y, so that we can try to optimize it to find a good predictive model.

Idea #1 Following along with our previous approaches to designing loss functions, we
might want to say that our predictions Xai should agree with our data Yai , and then add
some regularization, yielding loss function

1 X X
Loss(X, Y) = (Yai − Xai )2 + X2ai .
2
(a,i)∈D all (a,i)

This is a bad idea! It will set Xai = 0 for all (a, i) 6∈ D.


Study Question: Convince yourself of that!
We need to find a different kind of regularization that will force some generalization to
unseen entries.

Linear algebra idea: The rank of a matrix is the maximum number of linearly in-
dependent rows in the matrix (which is equal to the maximum number of linearly
independent columns in the matrix).
If an n × m matrix X is rank 1, then there exist U and V of shapes n × 1 and m × 1,
respectively, such that
X = UV T .
If X is rank k, then there exist U and V of shape n × k and m × k, respsectively, such
that
X = UV T .

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 88

Idea #2 Find the rank 1 matrix X that fits the entries in Y as well as possible. This is a much
lower-dimensional representation (it has m + n parameters rather than m · n parameters)
and the same parameter is shared among many predictions, so it seems like it might have
better generalization properties than our previous idea.
So, we would need to find vectors U and V such that
 (1)   (1) (1)
· · · U(1) V (m)

U U V
UV T =  ...  V (1) · · · V (m) =  .. .. ..
  
=X .
 
. . .
U(n) U(n) V (1) ··· U(n) V (m)

And, since we’re using squared loss, our objective function would be

1 X
J(U, V) = (U(a) V (i) − Yai )2 .
2
(a,i)∈D

Now, how can we find the optimal values of U and V? We could take inspiration from
our work on linear regression and see that the gradients of J are with respect to the param-
eters in U and V. For example,

∂J X
= (U(a) V (i) − Yai )V (i) .
∂U(a)
i, for (a,i)∈D

We could get an equation like this for each parameter U(a) or V (i) . We don’t know how to
get an immediate analytic solution to this set of equations because the parameters U and
V are multiplied by one another in the predictions, so the model does not have a linear
dependence on the parameters. We could approach this problem using gradient descent,
though, and we’ll do that with a related model in the next section.
But, before we talk about optimization, let’s think about the expressiveness of this
model. It has one parameter per user (the elements of U) and one parameter per movie
(the elements of V), and the predicted rating is the product of these two. It can really repre-
sent only each user’s general enthusiasm and each movie’s general popularity, and predict
the user’s rating of the movie to be the product of these values.
Study Question: What if we had two users, 1 and 2, and two movies, A and B. Can
you find U, V that represents well the data set (1, A, 1), (1, B, 5), (2, A, 5), (2, B, 1)?

Idea #3 If using a rank 1 decomposition of the matrix is not expressive enough, maybe
we can try a rank k decomposition! In this case, we would try to find an n × k matrix U and
an m × k matrix V that minimize
1 X
J(U, V) = (U(a) · V (i) − Yai )2 .
2
(a,i)∈D

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 89

V (i)

VT

U(a)
Xai = U(a) · V (i)
U

Here, the length k vector U(a) is the ath row of U, and represents the k “features”
of person a. Likewise, the length k vector V (i) is the ith row of V, and represents the k
“features” of movie i. Performing the matrix multiplication X = UV T , we see what the
prediction for person a and movie i is Xai = U(a) · V (i) .
The total number of parameters that we have is nk + mk. But, it is a redundant repre-
sentation. We have 1 extra scaling parameter when k = 1, and k2 in general. So, we really
effectively have nk + mk − k2 “degrees of freedom.”
Study Question: Imagine k = 3. If we were to take the matrix U and multiply the
first column by 2, the second column by 3 and the third column by 4, to make a new
matrix U 0 , what would we have to do to V to get a V 0 so that U 0 V 0 T = UV T ? How
does this question relate to the comments above about redundancy?
It is still useful to add offsets to our predictions, so we will include an n × 1 vector bU
and an m × 1 vector bV of offset parameters, and perform regularization on the parameters
in U and V. So our final objective becomes

1 X (a) (i) λ X
n
(a) 2 λ X (i) 2
m
J(U, V) = (U(a) · V (i) + bU + bV − Yai )2 + U + V .
2 2 2
(a,i)∈D a=1 i=1

(a) (i)
Study Question: What would be an informal interpretation of bU ? Of bV ?

2.1 Optimization
Now that we have an objective, it’s time to optimize! There are two reasonable approaches
to finding U, V, bU , and bV that optimize this objective: alternating least squares (ALS),
which builds on our analytical solution approach for linear regression, and stochastic gra-
dient descent (SGD), which we have used in the context of neural networks and other
models.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 90

2.1.1 Alternating least squares


One interesting thing to notice is that, if we were to fix U and bU , then finding the mini-
mizing V and bV is a linear regression problem that we already know how to solve. The
same is true if we were to fix V and bV , and seek U and bU . So, we will consider an al-
gorithm that takes alternating steps of this form: we fix U, bU , initially randomly, find the
best V, bV ; then fix those and find the best U, bU , etc.
This is a kind of optimization sometimes called “coordinate descent,” because we only
improve the model in one (or, in this case, a set of) coordinates of the parameter space at a
time. Generally, coordinate ascent has similar kinds of convergence properties as gradient
descent, and it cannot guarantee that we find a global optimum. It is an appealing choice
in this problem because we know how to directly move to the optimal values of one set of
coordinates given that the other is fixed.
More concretely, we:

1. Initialize V and bV at random

2. For each a in 1, 2, . . . , n

• Construct a linear regression problem to find U(a) to minimize

1 X 
(a) (i)
2 λ 2
U(a) · V (i) + bU + bV − Yai + U(a)

2 2
{i|(a,i)∈D}

• Recall minimizing the least squares objective (we are ignoring the offset and
regularizer in the following so you can see the basic idea):

(Wθ − T )T (Wθ − T ) .

In this scenario,
– θ = U(a) is the k × 1 parameter vector that we are trying to find,
– T is a ma × 1 vector of target values (for the ma movies a has rated), and
– W is the ma × k matrix whose rows are the V (i) where a has rated movie i.
The solution to the least squares problem using ridge regression is our new U(a)
(a)
and bU .

3. For each i in 1, 2, . . . , m
(i)
• Construct a linear regression problem to find V (i) and bV to minimize

1 X 
(a) (i)
2 λ 2
U(a) · V (i) + bU + bV − Yai + V (i)

2 2
{i|(a,i)∈D}

• Now, θ = V (i) is a k × 1 parameter vector, T is a ni × 1 target vector (for the ni


users that have rated movie i), and W is the ni × k matrix whose rows are the
U(a) where i has been rated by user a.
(i)
Again, we solve using ridge regression for a new value of V (i) and bV .

4. Alternate between steps 2 and 3, optimizing U and V, and stop after a fixed number
of iterations or when the difference between successive parameter estimates is small.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 91

2.1.2 Stochastic gradient descent


Finally, we can approach this problem using stochastic gradient descent. It’s easier to think
about if we reorganize the objective function to be

1 X
 2 2 2 
(a) (i) (a) (i)
J(U, V) = U(a) · V (i) + bU + bV − Yai + λU U(a) + λV V (i)

2
(a,i)∈D

where
λ λ
=P
(a)
λU =
# times (a, _) ∈ D {i|(a,i)∈D} 1
λ λ
=P
(i)
λV =
# times (_, i) ∈ D {a|(a,i)∈D} 1

Then,

∂J(U, V) X h
(a) (i)

(a)
i
= U(a) · V (i) + bU + bV − Yai V (i) + λU U(a)
∂U(a)
{i|(a,i)∈D}

∂J(U, V) X 
(a) (i)

(a)
= U(a) · V (i) + bU + bV − Yai
∂bU {i|(a,i)∈D}

(i)
We can similarly obtain gradients with respect to V (i) and bV .
Then, to do gradient descent, we draw an example (a, i, Yai ) from D at random, and do
(a) (i)
gradient updates on U(a) , bU , V (i) , and bV .
0
Study Question: Why don’t we update the other parameters, such as U(a )
for some
0
other user a 0 or V (i ) for some other movie i 0 ?

Last Updated: 02/01/19 01:38:02


CHAPTER 14

Non-parametric methods

1 Intro
We will continue to broaden the class of models that we can fit to our data. Neural networks
have adaptable complexity, in the sense that we can try different structural models and use
cross validation to find one that works well on our data.
We now turn to models that automatically adapt their complexity to the training data.
The name non-parametric methods is misleading: it is really a class of methods that does not
have a fixed parameterization in advance. Some non-parametric models, such as decision
trees, which we might call semi-parametric methods, can be seen as dynamically constructing
something that ends up looking like a more traditional parametric model, but where the ac-
tual training data affects exactly what the form of the model will be. Other non-parametric
methods, such as nearest-neighbor, rely directly on the data to make predictions and do
not compute a model that summarizes the data.
The semi-parametric methods tend to have the form of a composition of simple models.
We’ll look at:
• Tree models: partition the input space and use different simple predictions on different
regions of the space; this increases the hypothesis space.
• Additive models: train several different classifiers on the whole space and average the
answers; this decreases the estimation error.
Boosting is a way to construct an additive model that decreases both estimation and struc-
tural error, but we won’t address it in this class.

2 Trees
The idea here is that we would like to find a partition of the input space and then fit very
simple models to predict the output in each piece. The partition is described using a (typi-
cally binary) “decision tree,” which recursively splits the space.
These methods differ by:
• The class of possible ways to split the space at each node; these are generally linear
splits, either aligned with the axes of the space, or more general.

92
MIT 6.036 Fall 2018 93

• The class of predictors within the partitions; these are often simply constants, but
may be more general classification or regression models.

• The way in which we control the complexity of the hypothesis: it would be within
the capacity of these methods to have a separate partition for each individual training
example.

• The algorithm for making the partitions and fitting the models.

The primary advantage of tree models is that they are easily interpretable by humans.
This is important in application domains, such as medicine, where there are human ex-
perts who often ultimately make critical decisions and who need to feel confident in their
understanding of recommendations made by an algorithm.
We’ll concentrate on the CART/ID3 family of algorithms, which were invented inde-
pendently in the statistics and the artificial intelligence communities. They work by greed-
ily constructing a partition, where the splits are axis aligned and by fitting a constant model
in the leaves. The interesting questions are how to select the splits and and how to control
capacity. The regression and classification versions are very similar.

2.1 Regression
The predictor is made up of

• A partition function, π, mapping elements of the input space into exactly one of M
regions, R1 , . . . , RM .

• A collection of M output values, Om , one for each region.

If we already knew a division of the space into regions, we would set ŷm , the constant
output for region Rm , to be the average of the training output values in that region; that is:

Om = average{i|x(i) ∈Rm } y(i) .

Define the error in a region as


X
Em = (y(i) − Om )2 .
{i|x(i) ∈Rm }

Ideally, we would select the partition to minimize

X
M
λM + Em ,
m=1

for some regularization constant λ. It is enough to search over all partitions of the training
data (not all partitions of the input space) to optimize this, but the problem is NP-complete.

2.1.1 Building a tree


So, we’ll be greedy. We establish a criterion, given a set of data, for finding the best single
split of that data, and then apply it recursively to partition the space. We will select the
partition of the data that minimizes the sum of the mean squared errors of each partition.
Given a data set D, let

j,s (D) = {x ∈ D | xj > s}


• R+

j,s (D) = {x ∈ D | xj < s}


• R−

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 94

• ŷ+
j,s = average{i|x(i) ∈R+ (D)} y
(i)
j,s

• ŷ−
j,s = average{i|x(i) ∈R− (D)} y
(i)
j,s

BuildTree(D):
• If |D| 6 k: return Leaf(D)
• Find the variable j and split point s that minimizes:

ER+
j,s (D)
+ ER+
j,s (D)
.

• Return Node(j, s, BuildTree(R+ −


j,s (D)), BuildTree(Rj,s (D))

Each call to BuildTree considers O(dn) splits (only need to split between each data
point in each dimension); each requires O(n) work.

2.1.2 Pruning
It might be tempting to regularize by stopping for a somewhat large k, or by stopping
when splitting a node does not significantly decrease the error. One problem with short-
sighted stopping criteria is that they might not see the value of a split that is, essentially,
two-dimensional. So, we will tend to build a tree that is much too large, and then prune it
back.
Define cost complexity of a tree T , where m ranges over its leaves as
|T |
X
Cα (T ) = Em (T ) + α|T | .
m=1

For a fixed α, find a T that (approximately) minimizes Cα (T ) by “weakest-link” prun-


ing. Create a sequence of trees by successively removing the bottom-level split that mini-
mizes the increase in overall error, until the root is reached. Return the T in the sequence
that minimizes the criterion.
Pick α using cross validation.

2.2 Classification
The strategy for building and pruning classification trees is very similar to the one for
regression trees.
The output is now the majority of the output values in the leaf:

Om = majority{i|x(i) ∈Rm } y(i) .

Define the error in a region as the number of data points that do not have the value Om :

Em = {i | x(i) ∈ Rm and y(i) 6= Om } .

Define the empirical probability of an item from class k occurring in region m as:
{i | x(i) ∈ Rm and y(i) = k}

P̂mk = P̂(Rm )(k) = ,
Nm
where Nm is the number of training points in region m. We’ll define the empirical proba-
bilities of split values, as well, for later use.

(i)
{i | x(i) ∈ Rm and xj > v}

P̂mjv = P̂(Rmj )(v) =
Nm

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 95

Splitting criteria Minimize “impurity” in child nodes. Some measures include:


• Misclassification error:
Em
Qm (T ) = = 1 − P̂mOm
Nm
• Gini index: X
Qm (T ) = P̂mk (1 − P̂mk )
k

• Entropy: X
Qm (T ) = H(Rm ) = − P̂mk log P̂mk
k

Choosing the split that minimizes the entropy of the children is equivalent to maxi-
mize the information gain of the test Xj = v, defined by
 
infoGain(Xj = v, Rm ) = H(Rm ) − P̂mjv H(R+ −
j,v ) + (1 − P̂mjv )H(Rj,v )

In the two-class case, all the criteria have the values



0.0 when P̂m0 = 0.0
0.0 when P̂m0 = 1.0
The respective impurity curves are shown below, where p = p̂m0 :

There used to be endless haggling about which one to use. It seems to be traditional to use:
• Entropy to select which node to split while growing the tree
• Misclassification error in the pruning criterion
As a concerete example, consider the following images:

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 96

The left image depicts a set of labeled data points and the right shows a partition into
regions by a decision tree.

Points about trees There are many variations on this theme:


• Linear regression or other regression or classification method in each leaf

• Non-axis-parallel splits: e.g., run a perceptron for a while to get a split.


What’s good about trees:
• Easily interpretable

• Easy to handle multi-class classification

• Easy to handle different loss functions (just change predictor in the leaves)
What’s bad about trees:
• High estimation error: small changes in the data can result in very big changes in the
hypothesis.

• Often not the best predictions

Hierarchical mixture of experts Make a “soft” version of trees, in which the splits are
probabilistic (so every point has some degree of membership in every leaf). Can be trained
with a form of gradient descent.

3 Bagging
Bootstrap aggregation is a technique for reducing the estimation error of a non-linear predic-
tor, or one that is adaptive to the data.
• Construct B new data sets of size n by sampling them with replacement from D

• Train a predictor on each one: f̂b

• Regression case: bagged predictor is

1X b
B
f̂bag (x) = f̂ (x)
B
b=1

• Classification case: majority bagged predictor: let f̂b (x) be a “one-hot” vector with a
single 1 and K − 1 zeros, so that ŷb (x) = arg maxk f̂b (x)k . Then

1X b
B
f̂bag (x) = f̂ (x),
B
b=1

which is a vector containing the proportion of classifiers that predicted each class k
for input x; and the predicted output is

ŷbag (x) = arg max f̂bag (x)k .


k

There are theoretical arguments showing that bagging does, in fact, reduce estimation
error.
However, when we bag a model, any simple predictability is lost.

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 97

3.1 Random Forests


Random forests are collections of trees that are constructed to be de-correlated, so that
using them to vote gives maximal advantage.
For b = 1..B

• Draw a bootstrap sample Db of size n from D

• Grow a tree on data Db by recursively repeating these steps:

– Select m variables at random from the d variables


– Pick the best variable and split point among them
– Split the node

• return tree Tb

Given the ensemble of trees, vote to make a prediction on a new x.

4 Nearest Neighbor
In nearest-neighbor models, we don’t do any processing of the data at training time–we
just remember it! All the work is done at prediction time.
Input values x can be from any domain X (Rd , documents, tree-structured objects, etc.).
We just need a distance metric, d : X × X → R+ , which satisfies the following, for all
x, x 0 , x 00 ∈ X:

d(x, x) = 0
d(x, x 0 ) = d(x 0 , x)
d(x, x 00 ) 6 d(x, x 0 ) + d(x 0 , x 00 )

i=1 , our predictor for a new x ∈ X is


Given a data-set D = {(x(i) , y(i) )}n

h(x) = y(i) where i = arg min d(x, x(i) ) ,


i

that is, the predicted output associated with the training point that is closest to the query
point x.
This same algorithm works for regression and classification! It’s a floor wax and a
The nearest neighbor prediction function can be described by a Voronoi partition (di- dessert topping!
viding the space up into regions whose closest point is each individual training point) as
shown below:

Last Updated: 02/01/19 01:38:02


MIT 6.036 Fall 2018 98

In each region, we predict the associated y value.


There are several useful variations on this method. In k-nearest-neighbors, we find the
k training points nearest to the query point x and output the majority y value for classifi-
cation or the average for regression. We can also do locally weighted regression in which we
fit locally linear regression models to the k nearest points, possibly giving less weight to
those that are farther away. In large data-sets, it is important to use good data structures
(e.g., ball trees) to perform the nearest-neighbor look-ups efficiently (without looking at all
the data points each time).

Last Updated: 02/01/19 01:38:02

You might also like