Deep Learning Tutorial Release 0.1
Deep Learning Tutorial Release 0.1
Release 0.1
CONTENTS
LICENSE
Getting Started
3.1 Download . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 A Primer on Supervised Optimization for Deep Learning
3.5 Theano/Python Tips . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
7
8
14
.
.
.
.
.
.
.
17
17
18
19
22
23
24
34
Multilayer Perceptron
5.1 The Model . . . . . . . . . . . . . . .
5.2 Going from logistic regression to MLP
5.3 Putting it All Together . . . . . . . . .
5.4 Tips and Tricks for training MLPs . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
36
40
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
52
52
53
54
56
57
58
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
72
77
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
81
87
88
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
93
94
95
106
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
110
111
116
117
118
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
119
121
130
132
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
133
133
134
134
135
139
140
140
143
143
143
143
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
149
150
155
157
15 Miscellaneous
159
15.1 Plotting Samples and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
16 References
163
Bibliography
165
Index
167
iii
iv
CHAPTER
ONE
LICENSE
Chapter 1. LICENSE
CHAPTER
TWO
Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of
moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes
for a brief introduction to Machine Learning for AI and an introduction to Deep Learning algorithms.
Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of
data such as images, sound, and text. For more about deep learning algorithms, see for example:
The monograph or review paper Learning Deep Architectures for AI (Foundations & Trends in Machine Learning, 2009).
The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references.
The LISA public wiki has a reading list and a bibliography.
Geoff Hinton has readings from 2009s NIPS tutorial.
The tutorials presented here will introduce you to some of the most important deep learning algorithms and
will also show you how to run them using Theano. Theano is a python library that makes writing deep
learning models easy, and gives the option of training them on a GPU.
The algorithm tutorials have some prerequisites. You should know some python, and be familiar with
numpy. Since this tutorial is about using Theano, you should read over the Theano basic tutorial first. Once
youve done that, read through our Getting Started chapter it introduces the notation, and [downloadable]
datasets used in the algorithm tutorials, and the way we do optimization by stochastic gradient descent.
The purely supervised learning algorithms are meant to be read in order:
1. Logistic Regression - using Theano for something simple
2. Multilayer perceptron - introduction to layers
3. Deep Convolutional Network - a simplified version of LeNet5
The unsupervised and semi-supervised learning algorithms can be read in any order (the auto-encoders can
be read independently of the RBM/DBN thread):
Auto Encoders, Denoising Autoencoders - description of autoencoders
Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets
Restricted Boltzmann Machines - single layer generative RBM model
Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised
fine-tuning
3
Building towards including the mcRBM model, we have a new tutorial on sampling from energy models:
HMC Sampling - hybrid (aka Hamiltonian) Monte-Carlo sampling with scan()
Building towards including the Contractive auto-encoders tutorial, we have the code for now:
Contractive auto-encoders code - There is some basic doc in the code.
Recurrent neural networks with word embeddings and context window:
Semantic Parsing of Speech using Recurrent Net
LSTM network for sentiment analysis:
LSTM network
Energy-based recurrent neural network (RNN-RBM):
Modeling and generating sequences of polyphonic music
CHAPTER
THREE
GETTING STARTED
These tutorials do not attempt to make up for a graduate or undergraduate course in machine learning, but
we do make a rapid overview of some important concepts (and notation) to make sure that were on the same
page. Youll also need to download the datasets mentioned in this chapter in order to run the example code
of the up-coming tutorials.
3.1 Download
On each learning algorithm page, you will be able to download the corresponding files. If you want to
download all of them at the same time, you can clone the git repository of the tutorial:
git clone https://github.com/lisa-lab/DeepLearningTutorials.git
3.2 Datasets
3.2.1 MNIST Dataset
(mnist.pkl.gz)
The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples
for the training set and 10,000 examples for testing. In many papers as well as in this tutorial,
the official training set of 60,000 is divided into an actual training set of 50,000 examples and
10,000 validation examples (for selecting hyper-parameters like learning rate and size of the
model). All digit images have been size-normalized and centered in a fixed size image of 28 x
28 pixels. In the original dataset each pixel of the image is represented by a value between 0
and 255, where 0 is black, 255 is white and anything in between is a different shade of grey.
Here are some examples of MNIST digits:
For convenience we pickled the dataset to make it easier to use in python. It is available for
download here. The pickled file represents a tuple of 3 lists : the training set, the validation
set and the testing set. Each of the three lists is a pair formed from a list of images and a list
of class labels for each of the images. An image is represented as numpy 1-dimensional array
5
of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white). The labels are
numbers between 0 and 9 indicating which digit the image represents. The code block below
shows how to load the dataset.
import cPickle, gzip, numpy
# Load the dataset
f = gzip.open(mnist.pkl.gz, rb)
train_set, valid_set, test_set = cPickle.load(f)
f.close()
When using the dataset, we usually divide it in minibatches (see Stochastic Gradient Descent).
We encourage you to store the dataset into shared variables and access it based on the minibatch
index, given a fixed and known batch size. The reason behind shared variables is related to using
the GPU. There is a large overhead when copying data into the GPU memory. If you would
copy data on request ( each minibatch individually when needed) as the code will do if you do
not use shared variables, due to this overhead, the GPU code will not be much faster then the
CPU code (maybe even slower). If you have your data in Theano shared variables though, you
give Theano the possibility to copy the entire data on the GPU in a single call when the shared
variables are constructed. Afterwards the GPU can access any minibatch by taking a slice
from this shared variables, without needing to copy any information from the CPU memory
and therefore bypassing the overhead. Because the datapoints and their labels are usually of
different nature (labels are usually integers while datapoints are real numbers) we suggest to
use different variables for label and data. Also we recommend using different variables for
the training set, validation set and testing set to make the code more readable (resulting in 6
different shared variables).
Since now the data is in one variable, and a minibatch is defined as a slice of that variable, it
comes more natural to define a minibatch by indicating its index and its size. In our setup the
batch size stays constant throughout the execution of the code, therefore a function will actually
require only the index to identify on which datapoints to work. The code below shows how to
store your data and how to access a minibatch:
def shared_dataset(data_xy):
""" Function that loads the dataset into shared variables
The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as floatX as well
# (shared_y does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesnt make sense) therefore instead of returning
# shared_y we will have to cast it to int. This little hack
# lets us get around this issue
The data has to be stored as floats on the GPU ( the right dtype for storing on the GPU is given by
theano.config.floatX). To get around this shortcomming for the labels, we store them as float, and
then cast it to int.
Note: If you are running your code on the GPU and the dataset you are using is too large to fit in memory
the code will crash. In such a case you should store the data in a shared variable. You can however store a
sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training.
Once you got through the chunk, update the values it stores. This way you minimize the number of data
transfers between CPU memory and GPU memory.
3.3 Notation
3.3.1 Dataset notation
We label data sets as D. When the distinction is important, we indicate train, validation, and test sets as:
Dtrain , Dvalid and Dtest . The validation set is used to perform model selection and hyper-parameter selection, whereas the test set is used to evaluate the final generalization error and compare different algorithms
in an unbiased way.
The tutorials mostly deal with classification problems, where each data set D is an indexed set of pairs
(x(i) , y (i) ). We use superscripts to distinguish training set examples: x(i) RD is thus the i-th training
example of dimensionality D. Similarly, y (i) {0, ..., L} is the i-th label assigned to input x(i) . It is
straightforward to extend these examples to ones where y (i) has other types (e.g. Gaussian for regression,
or groups of multinomials for predicting multiple symbols).
3.3. Notation
{0, ..., L} is the prediction function, then this loss can be written as:
|D|
X
`0,1 =
If (x(i) )6=y(i)
i=0
where either D is the training set (during training) or D Dtrain = (to avoid biasing the evaluation of
validation or test error). I is the indicator function defined as:
1 if x is True
Ix =
0 otherwise
In this tutorial, f is defined as:
f (x) = argmaxk P (Y = k|x, )
In python, using Theano this can be written as :
# zero_one_loss is a Theano variable representing a symbolic
# expression of the zero one loss ; to get the actual value this
# symbolic expression has to be compiled into a Theano function (see
# the Theano tutorial for more details)
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
|D|
X
i=0
The likelihood of the correct class is not the same as the number of right predictions, but from the point of
view of a randomly initialized classifier they are pretty similar. Remember that likelihood and zero-one loss
are different objectives; you should see that they are corralated on the validation set but sometimes one will
rise while the other falls, or vice-versa.
Since we usually speak in terms of minimizing a loss function, learning will thus attempt to minimize the
negative log-likelihood (NLL), defined as:
N LL(, D) =
|D|
X
i=0
The NLL of our classifier is a differentiable surrogate for the zero-one loss, and we use the gradient of this
function over our training data as a supervised learning signal for deep learning of a classifier.
This can be computed using the following line of code :
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.
Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but
proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire
training set. In its purest form, we estimate the gradient from just a single example at a time.
# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
# imagine an infinite generator
# that may repeat examples (if there is only a finite training
loss = f(params, x_i, y_i)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params
The variant that we recommend for deep learning is a further twist on stochastic gradient descent using socalled minibatches. Minibatch SGD works identically to SGD, except that we use more than one training
example to make each estimate of the gradient. This technique reduces variance in the estimate of the
gradient, and often makes better use of the hierarchical memory organization in modern computers.
for (x_batch,y_batch) in train_batches:
# imagine an infinite generator
# that may repeat examples
loss = f(params, x_batch, y_batch)
d_loss_wrt_params = ... # compute gradient using theano
params -= learning_rate * d_loss_wrt_params
10
There is a tradeoff in the choice of the minibatch size B. The reduction of variance and use of SIMD
instructions helps most when increasing B from 1 to 2, but the marginal improvement fades rapidly to
nothing. With large B, time is wasted in reducing the variance of the gradient estimator, that time would be
better spent on additional gradient steps. An optimal B is model-, dataset-, and hardware-dependent, and
can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost
arbitrary (though harmless).
Note: If you are training for a fixed number of epochs, the minibatch size becomes important because it
controls the number of updates done to your parameters. Training the same model for 10 epochs using a
batch size of 1 yields completely different results compared to training for the same 10 epochs but with a
batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the
other parameters acording to the batch size used.
All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in
Theano can be done as follows :
# Minibatch Stochastic Gradient Descent
# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;
# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)
# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)
for (x_batch, y_batch) in train_batches:
# here x_batch and y_batch are elements of train_batches and
# therefore numpy arrays; function MSGD also updates the params
print(Current loss is , MSGD(x_batch, y_batch))
if stopping_condition_is_met:
return params
3.4.3 Regularization
There is more to machine learning than optimization. When we train our model from data we are trying
to prepare it to do well on new examples, not the ones it has already seen. The training loop above for
MSGD does not take this into account, and may overfit the training examples. A way to combat overfitting
is through regularization. There are several techniques for regularization; the ones we will explain here are
L1/L2 regularization and early-stopping.
11
L1 and L2 regularization
L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter
configurations. Formally, if our loss function is:
N LL(, D) =
|D|
X
i=0
||||p =
||
X
|j |
j=0
which is the Lp norm of . is a hyper-parameter which controls the relative importance of the regularization
parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the
regularizer is also called weight decay.
In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural
network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that
the network models). More intuitively, the two terms (NLL and R()) correspond to modelling the data
well (NLL) and having simple or smooth solutions (R()). Thus, minimizing the sum of both will, in
theory, correspond to finding the right trade-off between the fit to the training data and the generality of
the solution that is found. To follow Occams razor principle, this minimization should find us the simplest
solution (as measured by our simplicity criterion) that fits the training data.
Note that the fact that a solution is simple does not mean that it will generalize well. Empirically, it
was found that performing such regularization in the context of neural networks helps with generalization,
especially on small datasets. The code block below shows how to compute the loss in python when it
contains both a L1 regularization term weighted by 1 and L2 regularization term weighted by 2
# symbolic Theano variable that represents the L1 regularization term
L1 = T.sum(abs(param))
# symbolic Theano variable that represents the squared L2 term
L2_sqr = T.sum(param ** 2)
# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2
Early-Stopping
Early-stopping combats overfitting by monitoring the models performance on a validation set. A validation
set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The
12
validation examples are considered to be representative of future test examples. We can use them during
training because they are not part of the test set. If the models performance ceases to improve sufficiently
on the validation set, or even degrades with further optimization, then the heuristic implemented here gives
up on much further optimization.
The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use
of a strategy based on a geometrically increasing amount of patience.
# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2
# wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience/2)
# go through this many
# minibatches before checking the network
# on the validation set; in this case we
# check every epoch
best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()
done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
# Report "1" for first epoch, "n_epochs" for last epoch
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params # gradient descent
# iteration number. We want it to start at 0.
iter = (epoch - 1) * n_train_batches + minibatch_index
# note that if we do iter % validation_frequency it will be
# true for iter = 0 which we do not want. We want it true for
# iter = validation_frequency - 1.
if (iter + 1) % validation_frequency == 0:
this_validation_loss = ... # compute zero-one loss on validation set
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * improvement_threshold:
patience = max(patience, iter * patience_increase)
best_params = copy.deepcopy(params)
best_validation_loss = this_validation_loss
if patience <= iter:
13
done_looping = True
break
# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization
If we run out of batches of training data before running out of patience, then we just go back to the beginning
of the training set and repeat.
Note: The validation_frequency should always be smaller than the patience. The code should
check at least two times how it performs before running out of patience. This is the reason we used the
formulation validation_frequency = min( value, patience/2.)
Note: This algorithm could possibly be improved by using a test of statistical significance rather than the
simple comparison, when deciding whether to increase the patience.
3.4.4 Testing
After the loop exits, the best_params variable refers to the best-performing model on the validation set. If
we repeat this procedure for another model class, or even another random initialization, we should use the
same train/valid/test split of the data, and get other best-performing models. If we have to choose what the
best model class or the best initialization was, we compare the best_validation_loss for each model. When
we have finally chosen the model we think is the best (on validation data), we report that models test set
performance. That is the performance we expect on unseen examples.
3.4.5 Recap
Thats it for the optimization section. The technique of early-stopping requires us to partition the set of
examples into three sets (training Dtrain , validation Dvalid , test Dtest ). The training set is used for minibatch
stochastic gradient descent on the differentiable approximation of the objective function. As we perform
this gradient descent, we periodically consult the validation set to see how our model is doing on the real
objective function (or at least our empirical estimate of it). When we see a good model on the validation set,
we save it. When it has been a long time since seeing a good model, we abandon our search and return the
best parameters found, for evaluation on the test set.
14
The best way to save/archive your models parameters is to use pickle or deepcopy the ndarray objects. So
for example, if your parameters are in shared variables w, v, u, then your save command should look
something like:
>>>
>>>
>>>
>>>
>>>
>>>
import cPickle
save_file = open(path, wb) # this
cPickle.dump(w.get_value(borrow=True),
cPickle.dump(v.get_value(borrow=True),
cPickle.dump(u.get_value(borrow=True),
save_file.close()
Then later, you can load your data back like this:
>>>
>>>
>>>
>>>
save_file = open(path)
w.set_value(cPickle.load(save_file), borrow=True)
v.set_value(cPickle.load(save_file), borrow=True)
u.set_value(cPickle.load(save_file), borrow=True)
This technique is a bit verbose, but it is tried and true. You will be able to load your data and render it in
matplotlib without trouble, years after saving it.
Do not pickle your training or test functions for long-term storage
Theano functions are compatible with Pythons deepcopy and pickle mechanisms, but you should not necessarily pickle a Theano function. If you update your Theano folder and one of the internal changes, then
you may not be able to un-pickle your model. Theano is still in active development, and the internal APIs
are subject to change. So to be on the safe side do not pickle your entire training or testing functions for
long-term storage. The pickle mechanism is aimed at for short-term storage, such as a temp file, or a copy
to another machine in a distributed job.
Read more about serialization in Theano, or Pythons pickling.
15
16
CHAPTER
FOUR
Note: This sections assumes familiarity with the following Theano concepts: shared variables , basic
arithmetic ops , T.grad , floatX. If you intend to run the code on GPU also read GPU.
Note: The code for this section is available for download here.
In this section, we show how Theano can be used to implement the most basic classifier: the logistic regression. We start off with a quick primer of the model, which serves both as a refresher but also to anchor the
notation and show how mathematical expressions are mapped onto Theano graphs.
In the deepest of machine learning traditions, this tutorial will tackle the exciting problem of MNIST digit
classification.
The models prediction ypred is the class whose probability is maximal, specifically:
ypred = argmaxi P (Y = i|x, W, b)
17
Since the parameters of the model must maintain a persistent state throughout training, we allocate shared
variables for W, b. This declares them both as being symbolic Theano variables, but also initializes their
contents. The dot and softmax operators are then used to compute the vector P (Y |x, W, b). The result
p_y_given_x is a symbolic variable of vector-type.
To get the actual model prediction, we can use the T.argmax operator, which will return the index at which
p_y_given_x is maximal (i.e. the class with maximum probability).
Now of course, the model we have defined so far does not do anything useful yet, since its parameters are
still in their initial state. The following section will thus cover how to learn the optimal parameters.
Note: For a complete list of Theano ops, see: list of ops
the likelihood of the data set D under the model parameterized by . Let us first start by defining the
likelihood L and loss `:
L( = {W, b}, D) =
|D|
X
i=0
Note: Even though the loss is formally defined as the sum, over the data set, of individual error terms,
in practice, we use the mean (T.mean) in the code. This allows for the learning rate choice to be less
dependent of the minibatch size.
19
20
21
raise NotImplementedError()
We start by allocating symbolic variables for the training inputs x and their corresponding classes y. Note
that x and y are defined outside the scope of the LogisticRegression object. Since the class requires
the input to build its graph, it is passed as a parameter of the __init__ function. This is useful in case you
want to connect instances of such classes to form a deep network. The output of one layer can be passed as
the input of the layer above. (This tutorial does not build a multi-layer network, but this code will be reused
in future tutorials that do.)
Finally, we define a (symbolic) cost variable to minimize,
classifier.negative_log_likelihood.
Note that x is an implicit symbolic input to the definition of cost, because the symbolic variables of
classifier were defined in terms of x at initialization.
g_W and g_b are symbolic variables, which can be used as part of a computation graph. The function
train_model, which performs one step of gradient descent, can then be defined as follows:
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs.
updates = [(classifier.W, classifier.W - learning_rate * g_W),
22
batch_size],
batch_size]
updates is a list of pairs. In each pair, the first element is the symbolic variable to be updated in the
step, and the second element is the symbolic function for calculating its new value. Similarly, givens is a
dictionary whose keys are symbolic variables and whose values specify their replacements during the step.
The function train_model is then defined such that:
the input is the mini-batch index index that, together with the batch size (which is not an input since
it is fixed) defines x with corresponding labels y
the return value is the cost/loss associated with the x, y defined by the index
on every function call, it will first replace x and y with the slices from the training set specified by
index. Then, it will evaluate the cost associated with that minibatch and apply the operations defined
by the updates list.
Each time train_model(index) is called, it will thus compute and return the cost of a minibatch,
while also performing a step of MSGD. The entire learning algorithm thus consists in looping over all
examples in the dataset, considering all the examples in one minibatch at a time, and repeatedly calling the
train_model function.
23
"""
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
y should have the same shape as self.y_pred,
(y, y.type, y_pred, self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith(int):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
else:
raise NotImplementedError()
We then create a function test_model and a function validate_model, which we can call to retrieve
this value. As you will see shortly, validate_model is key to our early-stopping implementation (see
Early-Stopping). These functions take a minibatch index and compute, for the examples in that minibatch,
the number that were misclassified by the model. The only difference between them is that test_model
draws its minibatches from the testing set, while validate_model draws its from the validation set.
# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
validate_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
24
The output of the model or prediction is then done by taking the argmax of
the vector whose ith element is P(Y=i|x).
.. math::
y_{pred} = argmax_i P(Y=i|x,W,b)
References:
- textbooks: "Pattern Recognition and Machine Learning" Christopher M. Bishop, section 4.3.2
"""
__docformat__ = restructedtext en
import
import
import
import
import
cPickle
gzip
os
sys
timeit
import numpy
import theano
import theano.tensor as T
class LogisticRegression(object):
"""Multi-class Logistic Regression Class
The logistic regression is fully described by a weight matrix :math:W
and bias vector :math:b. Classification is done by projecting data
points onto a set of hyperplanes, the distance to which is used to
determine a class membership probability.
"""
def __init__(self, input, n_in, n_out):
25
26
27
def load_data(dataset):
Loads the dataset
:type dataset: string
:param dataset: the path to the dataset (here MNIST)
#############
# LOAD DATA #
#############
# Download the MNIST dataset if it is not present
data_dir, data_file = os.path.split(dataset)
if data_dir == "" and not os.path.isfile(dataset):
# Check if dataset is in the data directory.
new_path = os.path.join(
os.path.split(__file__)[0],
"..",
"data",
dataset
)
if os.path.isfile(new_path) or data_file == mnist.pkl.gz:
dataset = new_path
if (not os.path.isfile(dataset)) and data_file == mnist.pkl.gz:
import urllib
origin = (
http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
)
print Downloading data from %s % origin
urllib.urlretrieve(origin, dataset)
print ... loading data
# Load the dataset
f = gzip.open(dataset, rb)
train_set, valid_set, test_set = cPickle.load(f)
f.close()
#train_set, valid_set, test_set format: tuple(input, target)
#input is an numpy.ndarray of 2 dimensions (a matrix)
#witch rows correspond to an example. target is a
#numpy.ndarray of 1 dimensions (vector)) that have the same length as
#the number of rows in the input. It should give the target
#target to the example with the same index in the input.
def shared_dataset(data_xy, borrow=True):
28
29
"""
datasets = load_data(dataset)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size
######################
# BUILD ACTUAL MODEL #
######################
print ... building the model
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
#
#
x
y
30
batch_size],
batch_size]
###############
# TRAIN MODEL #
###############
print ... training the model
# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience / 2)
# go through this many
# minibatche before checking the network
# on the validation set; in this case we
# check every epoch
best_validation_loss = numpy.inf
test_score = 0.
start_time = timeit.default_timer()
done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):
minibatch_avg_cost = train_model(minibatch_index)
# iteration number
iter = (epoch - 1) * n_train_batches + minibatch_index
31
if (iter + 1) % validation_frequency == 0:
# compute zero-one loss on validation set
validation_losses = [validate_model(i)
for i in xrange(n_valid_batches)]
this_validation_loss = numpy.mean(validation_losses)
print(
epoch %i, minibatch %i/%i, validation error %f %% %
(
epoch,
minibatch_index + 1,
n_train_batches,
this_validation_loss * 100.
)
)
# if we got the best validation score until now
if this_validation_loss < best_validation_loss:
#improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * \
improvement_threshold:
patience = max(patience, iter * patience_increase)
best_validation_loss = this_validation_loss
# test it on the test set
test_losses = [test_model(i)
for i in xrange(n_test_batches)]
test_score = numpy.mean(test_losses)
print(
(
32
(
Optimization complete with best validation score of %f %%,
with test performance %f %%
)
% (best_validation_loss * 100., test_score * 100.)
)
print The code run for %d epochs, with %f epochs/sec % (
epoch, 1. * epoch / (end_time - start_time))
print >> sys.stderr, (The code for file +
os.path.split(__file__)[1] +
ran for %.1fs % ((end_time - start_time)))
def predict():
"""
An example of how to load a trained model and use it
to predict labels.
"""
# load the saved model
classifier = cPickle.load(open(best_model.pkl))
# compile a predictor function
predict_model = theano.function(
inputs=[classifier.input],
outputs=classifier.y_pred)
# We can test it on some examples from test test
dataset=mnist.pkl.gz
datasets = load_data(dataset)
test_set_x, test_set_y = datasets[2]
test_set_x = test_set_x.get_value()
predicted_values = predict_model(test_set_x[:10])
print ("Predicted values for the first 10 examples in test set:")
print predicted_values
if __name__ == __main__:
sgd_optimization_mnist()
The user can learn to classify MNIST digits with SGD logistic regression, by typing, from within the
DeepLearningTutorials folder:
python code/logistic_sgd.py
...
epoch 72, minibatch 83/83, validation error 7.510417 %
epoch 72, minibatch 83/83, test error of best model 7.510417 %
epoch 73, minibatch 83/83, validation error 7.500000 %
epoch 73, minibatch 83/83, test error of best model 7.489583 %
Optimization complete with best validation score of 7.500000 %,with test performance 7.4895
33
On an Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00 Ghz the code runs with approximately 1.936 epochs/sec
and it took 75 epochs to reach a test error of 7.489%. On the GPU the code does almost 10.0 epochs/sec.
For this instance we used a batch size of 600.
34
CHAPTER
FIVE
MULTILAYER PERCEPTRON
Note: This section assumes the reader has already read through Classifying MNIST digits using Logistic
Regression. Additionally, it uses the following new Theano functions and concepts: T.tanh, shared variables,
basic arithmetic ops, T.grad, L1 and L2 regularization, floatX. If you intend to run the code on GPU also
read GPU.
Note: The code for this section is available for download here.
The next architecture we are going to present using Theano is the single-hidden-layer Multi-Layer Perceptron (MLP). An MLP can be viewed as a logistic regression classifier where the input is first transformed
using a learnt non-linear transformation . This transformation projects the input data into a space where it
becomes linearly separable. This intermediate layer is referred to as a hidden layer. A single hidden layer is
sufficient to make MLPs a universal approximator. However we will see later on that there are substantial
benefits to using many such hidden layers, i.e. the very premise of deep learning. See these course notes
for an introduction to MLPs, the back-propagation algorithm, and how to train MLPs.
This tutorial will again tackle the problem of MNIST digit classification.
Formally, a one-hidden-layer MLP is a function f : RD RL , where D is the size of input vector x and L
35
is the size of the output vector f (x), such that, in matrix notation:
f (x) = G(b(2) + W (2) (s(b(1) + W (1) x))),
with bias vectors b(1) , b(2) ; weight matrices W (1) , W (2) and activation functions G and s.
The vector h(x) = (x) = s(b(1) + W (1) x) constitutes the hidden layer. W (1) RDDh is the weight
(1)
matrix connecting the input vector to the hidden layer. Each column Wi represents the weights from the
input units to the i-th hidden unit. Typical choices for s include tanh, with tanh(a) = (ea ea )/(ea +ea ),
or the logistic sigmoid function, with sigmoid(a) = 1/(1 + ea ). We will be using tanh in this tutorial
because it typically yields to faster training (and sometimes also to better local minima). Both the tanh and
sigmoid are scalar-to-scalar functions but their natural extension to vectors and tensors consists in applying
them element-wise (e.g. separately on each element of the vector, yielding a same-size vector).
The output vector is then obtained as: o(x) = G(b(2) + W (2) h(x)). The reader should recognize the form
we already used for Classifying MNIST digits using Logistic Regression. As before, class-membership probabilities can be obtained by choosing G as the sof tmax function (in the case of multi-class classification).
To train an MLP, we learn all parameters of the model, and here we use Stochastic Gradient Descent with
minibatches. The set of parameters to learn is the set = {W (2) , b(2) , W (1) , b(1) }. Obtaining the gradients `/ can be achieved through the backpropagation algorithm (a special case of the chain-rule of
derivation). Thankfully, since Theano performs automatic differentation, we will not need to cover this in
the tutorial !
36
The initial values for the weights of a hidden layer i should be uniformly sampled from a symmetric interval
that depends on the activationq
function. For tanh
q activation function results obtained in [Xavier10] show
6
6
that the interval should be [ f anin +f anout , f anin +f
anout ], where f anin is the number of units in the
(i 1)-th
q layer, and f an
qout is the number of units in the i-th layer. For the sigmoid function the interval
6
6
is [4 f anin +f
anout , 4
f anin +f anout ]. This initialization ensures that, early in training, each neuron
operates in a regime of its activation function where information can easily be propagated both upward
(activations flowing from inputs to outputs) and backward (gradients flowing from outputs to inputs).
Note that we used a given non-linear function as the activation function of the hidden layer. By default this
is tanh, but in many cases we might want to use something else.
37
If you look into theory this class implements the graph that computes the hidden layer value h(x) = (x) =
s(b(1) + W (1) x). If you give this graph as input to the LogisticRegression class, implemented in the
previous tutorial Classifying MNIST digits using Logistic Regression, you get the output of the MLP. You
can see this in the following short implementation of the MLP class.
class MLP(object):
"""Multi-Layer Perceptron Class
A multilayer perceptron is a feedforward artificial neural network model
that has one layer or more of hidden units and nonlinear activations.
Intermediate layers usually have as activation function tanh or the
sigmoid function (defined here by a HiddenLayer class) while the
top layer is a softmax layer (defined here by a LogisticRegression
class).
"""
def __init__(self, rng, input, n_in, n_hidden, n_out):
"""Initialize the parameters for the multilayer perceptron
:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights
:type input: theano.tensor.TensorType
:param input: symbolic variable that describes the input of the
architecture (one minibatch)
:type n_in: int
:param n_in: number of input units, the dimension of the space in
which the datapoints lie
:type n_hidden: int
:param n_hidden: number of hidden units
:type n_out: int
:param n_out: number of output units, the dimension of the space in
which the labels lie
"""
# Since we are dealing with a one hidden layer MLP, this will translate
# into a HiddenLayer with a tanh activation function connected to the
# LogisticRegression layer; the activation function can be replaced by
# sigmoid or any other nonlinear function
self.hiddenLayer = HiddenLayer(
rng=rng,
input=input,
n_in=n_in,
38
n_out=n_hidden,
activation=T.tanh
)
# The logistic regression layer gets as input the hidden units
# of the hidden layer
self.logRegressionLayer = LogisticRegression(
input=self.hiddenLayer.output,
n_in=n_hidden,
n_out=n_out
)
In this tutorial we will also use L1 and L2 regularization (see L1 and L2 regularization). For this, we need
to compute the L1 norm and the squared L2 norm of the weights W (1) , W (2) .
# L1 norm ; one regularization option is to enforce L1 norm to
# be small
self.L1 = (
abs(self.hiddenLayer.W).sum()
+ abs(self.logRegressionLayer.W).sum()
)
# square of L2 norm ; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
(self.hiddenLayer.W ** 2).sum()
+ (self.logRegressionLayer.W ** 2).sum()
)
# negative log likelihood of the MLP is given by the negative
# log likelihood of the output of the model, computed in the
# logistic regression layer
self.negative_log_likelihood = (
self.logRegressionLayer.negative_log_likelihood
)
# same holds for the function computing the number of errors
self.errors = self.logRegressionLayer.errors
# the parameters of the model are the parameters of the two layer it is
# made out of
self.params = self.hiddenLayer.params + self.logRegressionLayer.params
As before, we train this model using stochastic gradient descent with mini-batches. The difference is that we
modify the cost function to include the regularization term. L1_reg and L2_reg are the hyperparameters
controlling the weight of these regularization terms in the total cost function. The code that computes the
new cost is:
# the cost we minimize during training is the negative log likelihood of
# the model plus the regularization terms (L1 and L2); cost is expressed
# here symbolically
cost = (
classifier.negative_log_likelihood(y)
+ L1_reg * classifier.L1
39
+ L2_reg * classifier.L2_sqr
)
We then update the parameters of the model using the gradient. This code is almost identical to the one
for logistic regression. Only the number of parameters differ. To get around this ( and write code that
could work for any number of parameters) we will use the list of parameters that we created with the model
params and parse it, computing a gradient at each step.
# compute the gradient of cost with respect to theta (sotred in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params]
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs
# given two lists of the same length, A = [a1, a2, a3, a4] and
# B = [b1, b2, b3, b4], zip generates a list C of same size, where each
# element is a pair formed from the two lists :
#
C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(classifier.params, gparams)
]
# compiling a Theano function train_model that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in updates
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
40
.. math::
f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))),
References:
- textbooks: "Pattern Recognition and Machine Learning" Christopher M. Bishop, section 5
"""
__docformat__ = restructedtext en
import os
import sys
import timeit
import numpy
import theano
import theano.tensor as T
# start-snippet-1
class HiddenLayer(object):
def __init__(self, rng, input, n_in, n_out, W=None, b=None,
activation=T.tanh):
"""
Typical hidden layer of a MLP: units are fully-connected and have
sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
and the bias vector b is of shape (n_out,).
NOTE : The nonlinearity used here is tanh
Hidden unit activation is given by: tanh(dot(input,W) + b)
:type rng: numpy.random.RandomState
:param rng: a random number generator used to initialize weights
:type input: theano.tensor.dmatrix
:param input: a symbolic tensor of shape (n_examples, n_in)
:type n_in: int
:param n_in: dimensionality of input
:type n_out: int
:param n_out: number of hidden units
:type activation: theano.Op or function
:param activation: Non linearity to be applied in the hidden
41
layer
"""
self.input = input
# end-snippet-1
# W is initialized with W_values which is uniformely sampled
# from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
# for tanh activation function
# the output of uniform if converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
# Note : optimal initialization of weights is dependent on the
#
activation function used (among other things).
#
For example, results presented in [Xavier10] suggest that you
#
should use 4 times larger initial weights for sigmoid
#
compared to tanh
#
We have no info for other function, so we use the same as
#
tanh.
if W is None:
W_values = numpy.asarray(
rng.uniform(
low=-numpy.sqrt(6. / (n_in + n_out)),
high=numpy.sqrt(6. / (n_in + n_out)),
size=(n_in, n_out)
),
dtype=theano.config.floatX
)
if activation == theano.tensor.nnet.sigmoid:
W_values *= 4
W = theano.shared(value=W_values, name=W, borrow=True)
if b is None:
b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
b = theano.shared(value=b_values, name=b, borrow=True)
self.W = W
self.b = b
lin_output = T.dot(input, self.W) + self.b
self.output = (
lin_output if activation is None
else activation(lin_output)
)
# parameters of the model
self.params = [self.W, self.b]
# start-snippet-2
class MLP(object):
"""Multi-Layer Perceptron Class
A multilayer perceptron is a feedforward artificial neural network model
that has one layer or more of hidden units and nonlinear activations.
42
43
+ abs(self.logRegressionLayer.W).sum()
)
# square of L2 norm ; one regularization option is to enforce
# square of L2 norm to be small
self.L2_sqr = (
(self.hiddenLayer.W ** 2).sum()
+ (self.logRegressionLayer.W ** 2).sum()
)
# negative log likelihood of the MLP is given by the negative
# log likelihood of the output of the model, computed in the
# logistic regression layer
self.negative_log_likelihood = (
self.logRegressionLayer.negative_log_likelihood
)
# same holds for the function computing the number of errors
self.errors = self.logRegressionLayer.errors
# the parameters of the model are the parameters of the two layer it is
# made out of
self.params = self.hiddenLayer.params + self.logRegressionLayer.params
# end-snippet-3
# keep track of model input
self.input = input
44
http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
"""
datasets = load_data(dataset)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
# compute number of minibatches for training, validation and testing
n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size
######################
# BUILD ACTUAL MODEL #
######################
print ... building the model
# allocate symbolic variables for the data
index = T.lscalar() # index to a [mini]batch
x = T.matrix(x) # the data is presented as rasterized images
y = T.ivector(y) # the labels are presented as 1D vector of
# [int] labels
rng = numpy.random.RandomState(1234)
# construct the MLP class
classifier = MLP(
rng=rng,
input=x,
n_in=28 * 28,
n_hidden=n_hidden,
n_out=10
)
# start-snippet-4
# the cost we minimize during training is the negative log likelihood of
# the model plus the regularization terms (L1 and L2); cost is expressed
# here symbolically
cost = (
classifier.negative_log_likelihood(y)
+ L1_reg * classifier.L1
+ L2_reg * classifier.L2_sqr
)
# end-snippet-4
# compiling a Theano function that computes the mistakes that are made
# by the model on a minibatch
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
45
givens={
x: test_set_x[index * batch_size:(index + 1) * batch_size],
y: test_set_y[index * batch_size:(index + 1) * batch_size]
}
)
validate_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index * batch_size:(index + 1) * batch_size],
y: valid_set_y[index * batch_size:(index + 1) * batch_size]
}
)
# start-snippet-5
# compute the gradient of cost with respect to theta (sotred in params)
# the resulting gradients will be stored in a list gparams
gparams = [T.grad(cost, param) for param in classifier.params]
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs
# given two lists of the same length, A = [a1, a2, a3, a4] and
# B = [b1, b2, b3, b4], zip generates a list C of same size, where each
# element is a pair formed from the two lists :
#
C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)]
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(classifier.params, gparams)
]
# compiling a Theano function train_model that returns the cost, but
# in the same time updates the parameter of the model based on the rules
# defined in updates
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# end-snippet-5
###############
# TRAIN MODEL #
###############
print ... training
# early-stopping parameters
patience = 10000 # look as this many examples regardless
46
patience_increase = 2
47
if __name__ == __main__:
test_mlp()
On an Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz the code runs with approximately 10.3 epoch/minute
and it took 828 epochs to reach a test error of 1.65%.
To put this into perspective, we refer the reader to the results section of this page.
BackProp by Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller. In here, we summarize
the same issues, with an emphasis on the parameters and techniques that we actually used in our code.
5.4.1 Nonlinearity
Two of the most common ones are the sigmoid and the tanh function. For reasons explained in Section 4.4,
nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean
inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has
better convergence properties.
6
6
constraints leads to the following initialization: unif orm[ f an +f an , f an +f an ] for tanh and
in
out
in
out
6
6
, 4 f an +f
] for sigmoid. Where f anin is the number of inputs and
unif orm[4 f an +f
anout
anout
in
in
f anout the number of hidden units. For mathematical considerations please refer to [Xavier10].
Section 4.7 details procedures for choosing a learning rate for each parameter (weight) in our network and
for choosing them adaptively based on the error of the classifier.
49
50
CHAPTER
SIX
Note: This section assumes the reader has already read through Classifying MNIST digits using Logistic
Regression and Multilayer Perceptron. Additionally, it uses the following new Theano functions and concepts: T.tanh, shared variables, basic arithmetic ops, T.grad, floatX, downsample , conv2d, dimshuffle. If
you intend to run the code on GPU also read GPU.
To run this example on a GPU, you need a good GPU. It needs at least 1GB of GPU RAM. More may be
required if your monitor is connected to the GPU.
When the GPU is connected to the monitor, there is a limit of a few seconds for each GPU function call.
This is needed as current GPUs cant be used for the monitor while doing computation. Without this limit,
the screen would freeze for too long and make it look as if the computer froze. This example hits this limit
with medium-quality GPUs. When the GPU isnt connected to a monitor, there is no time limit. You can
lower the batch size to fix the time out problem.
Note: The code for this section is available for download here and the 3wolfmoon image
6.1 Motivation
Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. From Hubel and
Wiesels early work on the cats visual cortex [Hubel68], we know the visual cortex contains a complex
arrangement of cells. These cells are sensitive to small sub-regions of the visual field, called a receptive
field. The sub-regions are tiled to cover the entire visual field. These cells act as local filters over the input
space and are well-suited to exploit the strong spatially local correlation present in natural images.
Additionally, two basic cell types have been identified: Simple cells respond maximally to specific edge-like
patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to
the exact position of the pattern.
The animal visual cortex being the most powerful visual processing system in existence, it seems natural to
emulate its behavior. Hence, many neurally-inspired models can be found in the literature. To name a few:
the NeoCognitron [Fukushima], HMAX [Serre07] and LeNet-5 [LeCun98], which will be the focus of this
tutorial.
51
Imagine that layer m-1 is the input retina. In the above figure, units in layer m have receptive fields of width
3 in the input retina and are thus only connected to 3 adjacent neurons in the retina layer. Units in layer m+1
have a similar connectivity with the layer below. We say that their receptive field with respect to the layer
below is also 3, but their receptive field with respect to the input is larger (5). Each unit is unresponsive to
variations outside of its receptive field with respect to the retina. The architecture thus ensures that the learnt
filters produce the strongest response to a spatially local input pattern.
However, as shown above, stacking many such layers leads to (non-linear) filters that become increasingly
global (i.e. responsive to a larger region of pixel space). For example, the unit in hidden layer m+1 can
encode a non-linear feature of width 5 (in terms of pixel space).
In the above figure, we show 3 hidden units belonging to the same feature map. Weights of the same color
are sharedconstrained to be identical. Gradient descent can still be used to learn such shared parameters,
with only a small change to the original algorithm. The gradient of a shared weight is simply the sum of the
gradients of the parameters being shared.
Replicating units in this way allows for features to be detected regardless of their position in the visual field.
Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters
being learnt. The constraints on the model enable CNNs to achieve better generalization on vision problems.
52
u]
=
u= f [n u]g[u].
u=
P
P
This can be extended to 2D as follows: o[m, n] = f [m, n] g[m, n] =
u=
v= f [u, v]g[m
u, n v].
Note:
P
To form a richer representation of the data, each hidden layer is composed of multiple feature maps,
{h(k) , k = 0..K}. The weights W of a hidden layer can be represented in a 4D tensor containing elements for every combination of destination feature map, source feature map, source vertical position, and
source horizontal position. The biases b can be represented as a vector containing one element for every
destination feature map. We illustrate this graphically as follows:
53
W 1 of h0 and h1 are thus 3D weight tensors. The leading dimension indexes the input feature maps, while
the other two refer to the pixel coordinates.
Putting it all together, Wijkl denotes the weight connecting each pixel of the k-th feature map at layer m, with
the pixel at coordinates (i,j) of the l-th feature map of layer (m-1).
54
# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv.conv2d(input, W)
# build symbolic expression to add bias and apply activation function, i.e. produce neural
# A few words on dimshuffle :
#
dimshuffle is a powerful tool in reshaping a tensor;
#
what it allows you to do is to shuffle dimension around
#
but also to insert new ones along which the tensor will be
#
broadcastable;
#
dimshuffle(x, 2, x, 0, 1)
#
This will work on 3d tensors with no broadcastable
#
dimensions. The first dimension will be broadcastable,
#
then we will have the third dimension of the input tensor as
#
the second of the resulting tensor, etc. If the tensor has
#
shape (20, 30, 40), the resulting tensor will have dimensions
#
(1, 40, 1, 20, 30). (AxBxC tensor is mapped to 1xCx1xAxB tensor)
#
More examples:
#
dimshuffle(x) -> make a 0d (scalar) into a 1d vector
#
dimshuffle(0, 1) -> identity
#
dimshuffle(1, 0) -> inverts the first and second dimensions
#
dimshuffle(x, 0) -> make a row out of a 1d vector (N to 1xN)
#
dimshuffle(0, x) -> make a column out of a 1d vector (N to Nx1)
#
dimshuffle(2, 0, 1) -> AxBxC to CxAxB
#
dimshuffle(0, x, 1) -> AxB to Ax1xB
#
dimshuffle(1, x, 0) -> AxB to Bx1xA
output = T.nnet.sigmoid(conv_out + b.dimshuffle(x, 0, x, x))
# create theano function to compute filtered images
f = theano.function([input], output)
55
Notice that a randomly initialized filter acts very much like an edge detector!
Note that we use the same weight initialization formula as with the MLP. Weights are sampled randomly
from a uniform distribution in the range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden
unit. For MLPs, this was the number of units in the layer below. For CNNs however, we have to take into
account the number of input feature maps and the size of the receptive fields.
6.6 MaxPooling
Another important concept of CNNs is max-pooling, which is a form of non-linear down-sampling. Maxpooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region,
outputs the maximum value.
Max-pooling is useful in vision for two reasons:
1. By eliminating non-maximal values, it reduces computation for upper layers.
2. It provides a form of translation invariance. Imagine cascading a max-pooling layer with a
convolutional layer. There are 8 directions in which one can translate the input image by a
single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations
will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3
window, this jumps to 5/8.
Since it provides additional robustness to position, max-pooling is a smart way of reducing
the dimensionality of intermediate representations.
Max-pooling is done in Theano by way of theano.tensor.signal.downsample.max_pool_2d.
This function takes as input an N dimensional tensor (where N >= 2) and a downscaling factor and performs
max-pooling over the 2 trailing dimensions of the tensor.
An example is worth a thousand words:
from theano.tensor.signal import downsample
input = T.dtensor4(input)
maxpool_shape = (2, 2)
pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=True)
f = theano.function([input],pool_out)
invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
56
1.14374817e-04
3.45560727e-01
2.04452250e-01
5.58689828e-01
3.13424178e-01
3.02332573e-01
3.96767474e-01
8.78117436e-01
1.40386939e-01
6.92322616e-01
1.46755891e-01]
5.38816734e-01]
2.73875932e-02]
1.98101489e-01]
8.76389152e-01]]
0.49157316]
0.69975836]
0.04995346]
0.58655504]
0.39767684]]
Note that compared to most Theano code, the max_pool_2d operation is a little special. It requires the
downscaling factor ds (tuple of length 2 containing downscaling factors for image width and height) to be
known at graph build time. This may change in the near future.
57
The lower-layers are composed to alternating convolution and max-pooling layers. The upper-layers however are fully-connected and correspond to a traditional MLP (hidden layer + logistic regression). The input
to the first fully-connected layer is the set of all features maps at the layer below.
From an implementation point of view, this means lower-layers operate on 4D tensors. These are then
flattened to a 2D matrix of rasterized feature maps, to be compatible with our previous MLP implementation.
Note: Note that the term convolution could corresponds to different mathematical operations:
1. theano.tensor.nnet.conv2d, which is the most common one in almost all of the recent published convolutional models. In this operation, each output feature map is connected to each input feature map
by a different 2D filter, and its value is the sum of the individual convolution of all inputs through the
corresponding filter.
2. The convolution used in the original LeNet model: In this work, each output feature map is only
connected to a subset of input feature maps.
3. The convolution used in signal processing: theano.tensor.signal.conv.conv2d, which works only on
single channel inputs.
Here, we use the first operation, so this models differ slightly from the original LeNet paper. One reason to
use 2. would be to reduce the amount of computation needed, but modern hardware makes it as fast to have
the full connection pattern. Another reason would be to slightly reduce the number of free parameters, but
we have other regularization techniques at our disposal.
58
59
# add the bias term. Since the bias is a vector (1D array), we first
# reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
# thus be broadcasted across mini-batches and feature map
# width & height
self.output = T.tanh(pooled_out + self.b.dimshuffle(x, 0, x, x))
# store parameters of this layer
self.params = [self.W, self.b]
# keep track of model input
self.input = input
Notice that when initializing the weight values, the fan-in is determined by the size of the receptive fields
and the number of input feature maps.
Finally, using the LogisticRegression class defined in Classifying MNIST digits using Logistic Regression
and the HiddenLayer class defined in Multilayer Perceptron , we can instantiate the network as follows.
x = T.matrix(x)
y = T.ivector(y)
######################
# BUILD ACTUAL MODEL #
######################
print ... building the model
# Reshape matrix of rasterized images of shape (batch_size, 28 * 28)
# to a 4D tensor, compatible with our LeNetConvPoolLayer
# (28, 28) is the size of MNIST images.
layer0_input = x.reshape((batch_size, 1, 28, 28))
# Construct the first convolutional pooling layer:
# filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
# maxpooling reduces this further to (24/2, 24/2) = (12, 12)
# 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12)
layer0 = LeNetConvPoolLayer(
rng,
input=layer0_input,
image_shape=(batch_size, 1, 28, 28),
filter_shape=(nkerns[0], 1, 5, 5),
poolsize=(2, 2)
)
# Construct the second convolutional pooling layer
# filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8)
# maxpooling reduces this further to (8/2, 8/2) = (4, 4)
# 4D output tensor is thus of shape (batch_size, nkerns[1], 4, 4)
layer1 = LeNetConvPoolLayer(
rng,
input=layer0.output,
image_shape=(batch_size, nkerns[0], 12, 12),
filter_shape=(nkerns[1], nkerns[0], 5, 5),
60
poolsize=(2, 2)
)
# the HiddenLayer being fully-connected, it operates on 2D matrices of
# shape (batch_size, num_pixels) (i.e matrix of rasterized images).
# This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
# or (500, 50 * 4 * 4) = (500, 800) with the default values.
layer2_input = layer1.output.flatten(2)
# construct a fully-connected sigmoidal layer
layer2 = HiddenLayer(
rng,
input=layer2_input,
n_in=nkerns[1] * 4 * 4,
n_out=500,
activation=T.tanh
)
# classify the values of the fully-connected sigmoidal layer
layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10)
# the cost we minimize during training is the NLL of the model
cost = layer3.negative_log_likelihood(y)
# create a function to compute the mistakes that are made by the model
test_model = theano.function(
[index],
layer3.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
validate_model = theano.function(
[index],
layer3.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
# create a list of all model parameters to be fit by gradient descent
params = layer3.params + layer2.params + layer1.params + layer0.params
# create a list of gradients for all model parameters
grads = T.grad(cost, params)
#
#
#
#
61
We leave out the code that performs the actual training and early-stopping, since it is exactly the same as
with an MLP. The interested reader can nevertheless access the code in the code folder of DeepLearningTutorials.
The following output was obtained with the default parameters on a Core i7-2600K CPU clocked at 3.40GHz
and using flags floatX=float32:
Optimization complete.
Best validation score of 0.910000 % obtained at iteration 17800,with test
performance 0.920000 %
The code for file convolutional_mlp.py ran for 380.28m
Note that the discrepancies in validation and test error (as well as iteration count) are due to different implementations of the rounding mechanism in hardware. They can be safely ignored.
62
63
Tips
If you want to try this model on a new dataset, here are a few tips that can help you get better results:
Whitening the data (e.g. with PCA)
Decay the learning rate in each epoch
64
CHAPTER
SEVEN
Note: This section assumes the reader has already read through Classifying MNIST digits using Logistic
Regression and Multilayer Perceptron. Additionally it uses the following Theano functions and concepts
: T.tanh, shared variables, basic arithmetic ops, T.grad, Random numbers, floatX. If you intend to run the
code on GPU also read GPU.
Note: The code for this section is available for download here.
The Denoising Autoencoder (dA) is an extension of a classical autoencoder and it was introduced as a
building block for deep networks in [Vincent08]. We will start the tutorial with a short discussion on
Autoencoders.
7.1 Autoencoders
See section 4.6 of [Bengio09] for an overview of auto-encoders. An autoencoder takes an input x [0, 1]d
0
and first maps it (with an encoder) to a hidden representation y [0, 1]d through a deterministic mapping,
e.g.:
y = s(Wx + b)
Where s is a non-linearity such as the sigmoid. The latent representation y, or code is then mapped back
(with a decoder) into a reconstruction z of the same shape as x. The mapping happens through a similar
transformation, e.g.:
z = s(W0 y + b0 )
(Here, the prime symbol does not indicate matrix transposition.) z should be seen as a prediction of x,
given the code y. Optionally, the weight matrix W0 of the reverse mapping may be constrained to be the
transpose of the forward mapping: W0 = WT . This is referred to as tied weights. The parameters of this
model (namely W, b, b0 and, if one doesnt use tied weights, also W0 ) are optimized such that the average
reconstruction error is minimized.
The reconstruction error can be measured in many ways, depending on the appropriate distributional assumptions on the input given the code. The traditional squared error L(xz) = ||x z||2 , can be used. If
65
the input is interpreted as either bit vectors or vectors of bit probabilities, cross-entropy of the reconstruction
can be used:
LH (x, z) =
d
X
[xk log zk + (1 xk ) log(1 zk )]
k=1
The hope is that the code y is a distributed representation that captures the coordinates along the main
factors of variation in the data. This is similar to the way the projection on principal components would
capture the main factors of variation in the data. Indeed, if there is one linear hidden layer (the code) and the
mean squared error criterion is used to train the network, then the k hidden units learn to project the input in
the span of the first k principal components of the data. If the hidden layer is non-linear, the auto-encoder
behaves differently from PCA, with the ability to capture multi-modal aspects of the input distribution. The
departure from PCA becomes even more important when we consider stacking multiple encoders (and their
corresponding decoders) when building a deep auto-encoder [Hinton06].
Because y is viewed as a lossy compression of x, it cannot be a good (small-loss) compression for all x.
Optimization makes it a good compression for training examples, and hopefully for other inputs as well, but
not for arbitrary inputs. That is the sense in which an auto-encoder generalizes: it gives low reconstruction
error on test examples from the same distribution as the training examples, but generally high reconstruction
error on samples randomly chosen from the input space.
We want to implement an auto-encoder using Theano, in the form of a class, that could be afterwards used
in constructing a stacked autoencoder. The first step is to create shared variables for the parameters of the
autoencoder W, b and b0 . (Since we are using tied weights in this tutorial, WT will be used for W0 ):
def __init__(
self,
numpy_rng,
theano_rng=None,
input=None,
n_visible=784,
n_hidden=500,
W=None,
bhid=None,
bvis=None
):
"""
Initialize the dA class by specifying the number of visible units (the
dimension d of the input ), the number of hidden units ( the dimension
d of the latent or hidden space ) and the corruption level. The
constructor also receives symbolic variables for the input, weights and
bias. Such a symbolic variables are useful when, for example the input
is the result of some computations, or when weights are shared between
the dA and an MLP layer. When dealing with SdAs this always happens,
the dA on layer 2 gets as input the output of the dA on layer 1,
and the weights of the dA are used in the second stage of training
to construct an MLP.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: number random generator used to generate weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
66
"""
self.n_visible = n_visible
self.n_hidden = n_hidden
# create a Theano random generator that gives symbolic random values
if not theano_rng:
theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
# note : W was written as W_prime and b as b_prime
if not W:
# W is initialized with initial_W which is uniformely sampled
# from -4*sqrt(6./(n_visible+n_hidden)) and
# 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
# converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
initial_W = numpy.asarray(
numpy_rng.uniform(
low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
size=(n_visible, n_hidden)
),
dtype=theano.config.floatX
)
W = theano.shared(value=initial_W, name=W, borrow=True)
7.1. Autoencoders
67
if not bvis:
bvis = theano.shared(
value=numpy.zeros(
n_visible,
dtype=theano.config.floatX
),
borrow=True
)
if not bhid:
bhid = theano.shared(
value=numpy.zeros(
n_hidden,
dtype=theano.config.floatX
),
name=b,
borrow=True
)
self.W = W
# b corresponds to the bias of the hidden
self.b = bhid
# b_prime corresponds to the bias of the visible
self.b_prime = bvis
# tied weights, therefore W_prime is W transpose
self.W_prime = self.W.T
self.theano_rng = theano_rng
# if no input is given, generate a variable representing the input
if input is None:
# we use a matrix because we expect a minibatch of several
# examples, each example being a row
self.x = T.dmatrix(name=input)
else:
self.x = input
self.params = [self.W, self.b, self.b_prime]
Note that we pass the symbolic input to the autoencoder as a parameter. This is so that we can concatenate
layers of autoencoders to form a deep network: the symbolic output (the y above) of layer k will be the
symbolic input of layer k + 1.
Now we can express the computation of the latent representation and of the reconstructed signal:
def get_hidden_values(self, input):
""" Computes the values of the hidden layer """
return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden):
"""Computes the reconstructed input given the values of the
hidden layer
"""
return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
68
And using these functions we can compute the cost and the updates of one stochastic gradient descent step :
def get_cost_updates(self, corruption_level, learning_rate):
""" This function computes the cost and the updates for one trainng
step of the dA """
tilde_x = self.get_corrupted_input(self.x, corruption_level)
y = self.get_hidden_values(tilde_x)
z = self.get_reconstructed_input(y)
# note : we sum over the size of a datapoint; if we are using
#
minibatches, L will be a vector, with one entry per
#
example in minibatch
L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
# note : L is now a vector, where each element is the
#
cross-entropy cost of the reconstruction of the
#
corresponding example of the minibatch. We need to
#
compute the average of all these to get the cost of
#
the minibatch
cost = T.mean(L)
# compute the gradients of the cost of the dA with respect
# to its parameters
gparams = T.grad(cost, self.params)
# generate the list of updates
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(self.params, gparams)
]
return (cost, updates)
We can now define a function that applied iteratively will update the parameters W, b and b_prime such
that the reconstruction cost is approximately minimized.
da = dA(
numpy_rng=rng,
theano_rng=theano_rng,
input=x,
n_visible=28 * 28,
n_hidden=500
)
cost, updates = da.get_cost_updates(
corruption_level=0.,
learning_rate=learning_rate
)
train_da = theano.function(
[index],
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size]
}
7.1. Autoencoders
69
)
start_time = timeit.default_timer()
############
# TRAINING #
############
# go through training epochs
for epoch in xrange(training_epochs):
# go through trainng set
c = []
for batch_index in xrange(n_train_batches):
c.append(train_da(batch_index))
print Training epoch %d, cost % epoch, numpy.mean(c)
end_time = timeit.default_timer()
training_time = (end_time - start_time)
print >> sys.stderr, (The no corruption code for file +
os.path.split(__file__)[1] +
ran for %.2fm % ((training_time) / 60.))
image = Image.fromarray(
tile_raster_images(X=da.W.get_value(borrow=True).T,
img_shape=(28, 28), tile_shape=(10, 10),
tile_spacing=(1, 1)))
image.save(filters_corruption_0.png)
# start-snippet-3
#####################################
# BUILDING THE MODEL CORRUPTION 30% #
#####################################
rng = numpy.random.RandomState(123)
theano_rng = RandomStreams(rng.randint(2 ** 30))
da = dA(
numpy_rng=rng,
theano_rng=theano_rng,
input=x,
n_visible=28 * 28,
n_hidden=500
)
cost, updates = da.get_cost_updates(
corruption_level=0.3,
learning_rate=learning_rate
)
train_da = theano.function(
[index],
70
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size]
}
)
start_time = timeit.default_timer()
############
# TRAINING #
############
# go through training epochs
for epoch in xrange(training_epochs):
# go through trainng set
c = []
for batch_index in xrange(n_train_batches):
c.append(train_da(batch_index))
print Training epoch %d, cost % epoch, numpy.mean(c)
end_time = timeit.default_timer()
training_time = (end_time - start_time)
print >> sys.stderr, (The 30% corruption code for file +
os.path.split(__file__)[1] +
ran for %.2fm % (training_time / 60.))
# end-snippet-3
# start-snippet-4
image = Image.fromarray(tile_raster_images(
X=da.W.get_value(borrow=True).T,
img_shape=(28, 28), tile_shape=(10, 10),
tile_spacing=(1, 1)))
image.save(filters_corruption_30.png)
# end-snippet-4
os.chdir(../)
if __name__ == __main__:
test_dA()
If there is no constraint besides minimizing the reconstruction error, one might expect an auto-encoder with
n inputs and an encoding of dimension n (or greater) to learn the identity function, merely mapping an input
to its copy. Such an autoencoder would not differentiate test examples (from the training distribution) from
other input configurations.
Surprisingly, experiments reported in [Bengio07] suggest that, in practice, when trained with stochastic
gradient descent, non-linear auto-encoders with more hidden units than inputs (called overcomplete) yield
useful representations. (Here, useful means that a network taking the encoding as input has low classifi7.1. Autoencoders
71
cation error.)
A simple explanation is that stochastic gradient descent with early stopping is similar to an L2 regularization
of the parameters. To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder
with non-linear hidden units (exactly like in the above code) needs very small weights in the first (encoding)
layer, to bring the non-linearity of the hidden units into their linear regime, and very large weights in the
second (decoding) layer. With binary inputs, very large weights are also needed to completely minimize
the reconstruction error. Since the implicit or explicit regularization makes it difficult to reach large-weight
solutions, the optimization algorithm finds encodings which only work well for examples similar to those in
the training set, which is what we want. It means that the representation is exploiting statistical regularities
present in the training set, rather than merely learning to replicate the input.
There are other ways by which an auto-encoder with more hidden units than inputs could be prevented from
learning the identity function, capturing something useful about the input in its hidden representation. One
is the addition of sparsity (forcing many of the hidden units to be zero or near-zero). Sparsity has been exploited very successfully by many [Ranzato07] [Lee08]. Another is to add randomness in the transformation
from input to reconstruction. This technique is used in Restricted Boltzmann Machines (discussed later in
Restricted Boltzmann Machines (RBM)), as well as in Denoising Auto-Encoders, discussed below.
72
In the stacked autoencoder class (Stacked Autoencoders) the weights of the dA class have to be shared with
those of a corresponding sigmoid layer. For this reason, the constructor of the dA also gets Theano variables
pointing to the shared parameters. If those parameters are left to None, new ones will be constructed.
The final denoising autoencoder class becomes :
class dA(object):
"""Denoising Auto-Encoder class (dA)
A denoising autoencoders tries to reconstruct the input from a corrupted
version of it by projecting it first in a latent space and reprojecting
it afterwards back in the input space. Please refer to Vincent et al.,2008
for more details. If x is the input then equation (1) computes a partially
destroyed version of x by means of a stochastic mapping q_D. Equation (2)
computes the projection of the input into the latent space. Equation (3)
computes the reconstruction of the input, while equation (4) computes the
reconstruction error.
.. math::
\tilde{x} ~ q_D(\tilde{x}|x)
(1)
y = s(W \tilde{x} + b)
(2)
x = s(W y
(3)
+ b)
(4)
"""
def __init__(
self,
numpy_rng,
73
theano_rng=None,
input=None,
n_visible=784,
n_hidden=500,
W=None,
bhid=None,
bvis=None
):
"""
Initialize the dA class by specifying the number of visible units (the
dimension d of the input ), the number of hidden units ( the dimension
d of the latent or hidden space ) and the corruption level. The
constructor also receives symbolic variables for the input, weights and
bias. Such a symbolic variables are useful when, for example the input
is the result of some computations, or when weights are shared between
the dA and an MLP layer. When dealing with SdAs this always happens,
the dA on layer 2 gets as input the output of the dA on layer 1,
and the weights of the dA are used in the second stage of training
to construct an MLP.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: number random generator used to generate weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
:param theano_rng: Theano random generator; if None is given one is
generated based on a seed drawn from rng
:type input: theano.tensor.TensorType
:param input: a symbolic description of the input or None for
standalone dA
:type n_visible: int
:param n_visible: number of visible units
:type n_hidden: int
:param n_hidden: number of hidden units
:type W: theano.tensor.TensorType
:param W: Theano variable pointing to a set of weights that should be
shared belong the dA and another architecture; if dA should
be standalone set this to None
:type bhid: theano.tensor.TensorType
:param bhid: Theano variable pointing to a set of biases values (for
hidden units) that should be shared belong dA and another
architecture; if dA should be standalone set this to None
:type bvis: theano.tensor.TensorType
:param bvis: Theano variable pointing to a set of biases values (for
visible units) that should be shared belong dA and another
architecture; if dA should be standalone set this to None
74
"""
self.n_visible = n_visible
self.n_hidden = n_hidden
# create a Theano random generator that gives symbolic random values
if not theano_rng:
theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
# note : W was written as W_prime and b as b_prime
if not W:
# W is initialized with initial_W which is uniformely sampled
# from -4*sqrt(6./(n_visible+n_hidden)) and
# 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
# converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
initial_W = numpy.asarray(
numpy_rng.uniform(
low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
size=(n_visible, n_hidden)
),
dtype=theano.config.floatX
)
W = theano.shared(value=initial_W, name=W, borrow=True)
if not bvis:
bvis = theano.shared(
value=numpy.zeros(
n_visible,
dtype=theano.config.floatX
),
borrow=True
)
if not bhid:
bhid = theano.shared(
value=numpy.zeros(
n_hidden,
dtype=theano.config.floatX
),
name=b,
borrow=True
)
self.W = W
# b corresponds to the bias of the hidden
self.b = bhid
# b_prime corresponds to the bias of the visible
self.b_prime = bvis
# tied weights, therefore W_prime is W transpose
self.W_prime = self.W.T
self.theano_rng = theano_rng
# if no input is given, generate a variable representing the input
75
if input is None:
# we use a matrix because we expect a minibatch of several
# examples, each example being a row
self.x = T.dmatrix(name=input)
else:
self.x = input
self.params = [self.W, self.b, self.b_prime]
def get_corrupted_input(self, input, corruption_level):
"""This function keeps 1-corruption_level entries of the inputs the
same and zero-out randomly selected subset of size coruption_level
Note : first argument of theano.rng.binomial is the shape(size) of
random numbers that it should produce
second argument is the number of trials
third argument is the probability of success of any trial
this will produce an array of 0s and 1s where 1 has a
probability of 1 - corruption_level and 0 with
corruption_level
The binomial function return int64 data type by
default. int64 multiplicated by the input
type(floatX) always return float64. To keep all data
in floatX when floatX is float32, we set the dtype of
the binomial to floatX. As in our case the value of
the binomial is always 0 or 1, this dont change the
result. This is needed to allow the gpu to work
correctly as it only support float32 for now.
"""
return self.theano_rng.binomial(size=input.shape, n=1,
p=1 - corruption_level,
dtype=theano.config.floatX) * input
def get_hidden_values(self, input):
""" Computes the values of the hidden layer """
return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden):
"""Computes the reconstructed input given the values of the
hidden layer
"""
return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
def get_cost_updates(self, corruption_level, learning_rate):
""" This function computes the cost and the updates for one trainng
step of the dA """
tilde_x = self.get_corrupted_input(self.x, corruption_level)
y = self.get_hidden_values(tilde_x)
z = self.get_reconstructed_input(y)
76
77
[index],
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size]
}
)
start_time = timeit.default_timer()
############
# TRAINING #
############
# go through training epochs
for epoch in xrange(training_epochs):
# go through trainng set
c = []
for batch_index in xrange(n_train_batches):
c.append(train_da(batch_index))
print Training epoch %d, cost % epoch, numpy.mean(c)
end_time = timeit.default_timer()
training_time = (end_time - start_time)
print >> sys.stderr, (The 30% corruption code for file +
os.path.split(__file__)[1] +
ran for %.2fm % (training_time / 60.))
In order to get a feeling of what the network learned we are going to plot the filters (defined by the weight
matrix). Bear in mind, however, that this does not provide the entire story, since we neglect the biases and
plot the weights up to a multiplicative constant (weights are converted to values between 0 and 1).
To plot our filters we will need the help of tile_raster_images (see Plotting Samples and Filters) so
we urge the reader to study it. Also using the help of the Python Image Library, the following lines of code
will save the filters as an image :
image = Image.fromarray(tile_raster_images(
X=da.W.get_value(borrow=True).T,
img_shape=(28, 28), tile_shape=(10, 10),
tile_spacing=(1, 1)))
image.save(filters_corruption_30.png)
78
79
80
CHAPTER
EIGHT
Note: This section assumes you have already read through Classifying MNIST digits using Logistic Regression and Multilayer Perceptron. Additionally it uses the following Theano functions and concepts : T.tanh,
shared variables, basic arithmetic ops, T.grad, Random numbers, floatX. If you intend to run the code on
GPU also read GPU.
Note: The code for this section is available for download here.
The Stacked Denoising Autoencoder (SdA) is an extension of the stacked autoencoder [Bengio07] and it
was introduced in [Vincent08].
This tutorial builds on the previous tutorial Denoising Autoencoders. Especially if you do not have experience with autoencoders, we recommend reading it before going any further.
81
the autoencoders and the sigmoid layers of the MLP share parameters, and
the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.
class SdA(object):
"""Stacked denoising auto-encoder class (SdA)
A stacked denoising autoencoder model is obtained by stacking several
dAs. The hidden layer of the dA at layer i becomes the input of
the dA at layer i+1. The first layer dA gets as input the input of
the SdA, and the hidden layer of the last dA represents the output.
Note that after pretraining, the SdA is dealt with as a normal MLP,
the dAs are only used to initialize the weights.
"""
def __init__(
self,
numpy_rng,
theano_rng=None,
n_ins=784,
hidden_layers_sizes=[500, 500],
n_outs=10,
corruption_levels=[0.1, 0.1]
):
""" This class is made to support a variable number of layers.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: numpy random number generator used to draw initial
weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
:param theano_rng: Theano random generator; if None is given one is
generated based on a seed drawn from rng
:type n_ins: int
:param n_ins: dimension of the input to the sdA
:type n_layers_sizes: list of ints
:param n_layers_sizes: intermediate layers size, must contain
at least one value
:type n_outs: int
:param n_outs: dimension of the output of the network
:type corruption_levels: list of float
:param corruption_levels: amount of corruption to use for each
layer
"""
self.sigmoid_layers = []
self.dA_layers = []
self.params = []
self.n_layers = len(hidden_layers_sizes)
82
self.sigmoid_layers will store the sigmoid layers of the MLP facade, while self.dA_layers
will store the denoising autoencoder associated with the layers of the MLP.
Next, we construct n_layers sigmoid layers and n_layers denoising autoencoders, where n_layers
is the depth of our model. We use the HiddenLayer class introduced in Multilayer Perceptron, with one
modification: we replace the tanh non-linearity with the logistic function s(x) = 1+e1x ). We link the
sigmoid layers to form an MLP, and construct the denoising autoencoders such that each shares the weight
matrix and the bias of its encoding part with its corresponding sigmoid layer.
for i in xrange(self.n_layers):
# construct the sigmoidal layer
# the size of the input is either the number of hidden units of
# the layer below or the input size if we are on the first layer
if i == 0:
input_size = n_ins
else:
input_size = hidden_layers_sizes[i - 1]
# the input to this layer is either the activation of the hidden
# layer below or the input of the SdA if you are on the first
# layer
if i == 0:
layer_input = self.x
else:
layer_input = self.sigmoid_layers[-1].output
sigmoid_layer = HiddenLayer(rng=numpy_rng,
input=layer_input,
n_in=input_size,
n_out=hidden_layers_sizes[i],
activation=T.nnet.sigmoid)
# add the layer to our list of layers
self.sigmoid_layers.append(sigmoid_layer)
# its arguably a philosophical question...
# but we are going to only declare that the parameters of the
# sigmoid_layers are parameters of the StackedDAA
# the visible biases in the dA are parameters of those
# dA, but not the SdA
self.params.extend(sigmoid_layer.params)
# Construct a denoising autoencoder that shared weights with this
# layer
dA_layer = dA(numpy_rng=numpy_rng,
83
theano_rng=theano_rng,
input=layer_input,
n_visible=input_size,
n_hidden=hidden_layers_sizes[i],
W=sigmoid_layer.W,
bhid=sigmoid_layer.b)
self.dA_layers.append(dA_layer)
All we need now is to add a logistic layer on top of the sigmoid layers such that we have an MLP. We will
use the LogisticRegression class introduced in Classifying MNIST digits using Logistic Regression.
# We now need to add a logistic layer on top of the MLP
self.logLayer = LogisticRegression(
input=self.sigmoid_layers[-1].output,
n_in=hidden_layers_sizes[-1],
n_out=n_outs
)
self.params.extend(self.logLayer.params)
# construct a function that implements one step of finetunining
# compute the cost for second phase of training,
# defined as the negative log likelihood
self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)
# compute the gradients with respect to the model parameters
# symbolic variable that points to the number of errors made on the
# minibatch given by self.x and self.y
self.errors = self.logLayer.errors(self.y)
The SdA class also provides a method that generates training functions for the denoising autoencoders in its
layers. They are returned as a list, where element i is a function that implements one step of training the dA
corresponding to layer i.
def pretraining_functions(self, train_set_x, batch_size):
Generates a list of functions, each of them implementing one
step in trainnig the dA corresponding to the layer with same index.
The function will require as input the minibatch index, and to train
a dA you just need to iterate, calling the corresponding function on
all minibatch indexes.
:type train_set_x: theano.tensor.TensorType
:param train_set_x: Shared variable that contains all datapoints used
for training the dA
:type batch_size: int
:param batch_size: size of a [mini]batch
:type learning_rate: float
:param learning_rate: learning rate used during training for any of
the dA layers
# index to a [mini]batch
84
index = T.lscalar(index)
# index to a minibatch
To be able to change the corruption level or the learning rate during training, we associate Theano variables
with them.
corruption_level = T.scalar(corruption) # % of corruption to use
learning_rate = T.scalar(lr) # learning rate to use
# begining of a batch, given index
batch_begin = index * batch_size
# ending of a batch given index
batch_end = batch_begin + batch_size
pretrain_fns = []
for dA in self.dA_layers:
# get the cost and the updates list
cost, updates = dA.get_cost_updates(corruption_level,
learning_rate)
# compile the theano function
fn = theano.function(
inputs=[
index,
theano.Param(corruption_level, default=0.2),
theano.Param(learning_rate, default=0.1)
],
outputs=cost,
updates=updates,
givens={
self.x: train_set_x[batch_begin: batch_end]
}
)
# append fn to the list of functions
pretrain_fns.append(fn)
return pretrain_fns
Now any function pretrain_fns[i] takes as arguments index and optionally corruptionthe
corruption level or lrthe learning rate. Note that the names of the parameters are the names given to the
Theano variables when they are constructed, not the names of the Python variables (learning_rate or
corruption_level). Keep this in mind when working with Theano.
In the same fashion we build a method for constructing the functions required during finetuning
(train_fn, valid_score and test_score).
def build_finetune_functions(self, datasets, batch_size, learning_rate):
Generates a function train that implements one step of
finetuning, a function validate that computes the error on
a batch from the validation set, and a function test that
computes the error on a batch from the testing set
:type datasets: list of pairs of theano.tensor.TensorType
:param datasets: It is a list that contain all the datasets;
the has to contain three pairs, train,
valid, test in this order, where each pair
85
# index to a [mini]batch
86
self.y: test_set_y[
index * batch_size: (index + 1) * batch_size
]
},
name=test
)
valid_score_i = theano.function(
[index],
self.errors,
givens={
self.x: valid_set_x[
index * batch_size: (index + 1) * batch_size
],
self.y: valid_set_y[
index * batch_size: (index + 1) * batch_size
]
},
name=valid
)
# Create a function that scans the entire validation set
def valid_score():
return [valid_score_i(i) for i in xrange(n_valid_batches)]
# Create a function that scans the entire test set
def test_score():
return [test_score_i(i) for i in xrange(n_test_batches)]
return train_fn, valid_score, test_score
Note that valid_score and test_score are not Theano functions, but rather Python functions that
loop over the entire validation set and the entire test set, respectively, producing a list of the losses over
these sets.
There are two stages of training for this network: layer-wise pre-training followed by fine-tuning.
For the pre-training stage, we will loop over all the layers of the network. For each layer we will use the
8.2. Putting it all together
87
compiled Theano function that implements a SGD step towards optimizing the weights for reducing the
reconstruction cost of that layer. This function will be applied to the training set for a fixed number of
epochs given by pretraining_epochs.
#########################
# PRETRAINING THE MODEL #
#########################
print ... getting the pretraining functions
pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x,
batch_size=batch_size)
print ... pre-training the model
start_time = timeit.default_timer()
## Pre-train layer-wise
corruption_levels = [.1, .2, .3]
for i in xrange(sda.n_layers):
# go through pretraining epochs
for epoch in xrange(pretraining_epochs):
# go through the training set
c = []
for batch_index in xrange(n_train_batches):
c.append(pretraining_fns[i](index=batch_index,
corruption=corruption_levels[i],
lr=pretrain_lr))
print Pre-training layer %i, epoch %d, cost % (i, epoch),
print numpy.mean(c)
end_time = timeit.default_timer()
print >> sys.stderr, (The pretraining code for file +
os.path.split(__file__)[1] +
ran for %.2fm % ((end_time - start_time) / 60.))
The fine-tuning loop is very similar to the one in the Multilayer Perceptron. The only difference is that it
uses the functions given by build_finetune_functions.
By default the code runs 15 pre-training epochs for each layer, with a batch size of 1. The corruption levels
are 0.1 for the first layer, 0.2 for the second, and 0.3 for the third. The pretraining learning rate is 0.001
and the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with an average of 13 minutes per
epoch. Fine-tuning is completed after 36 epochs in 444.2 minutes, with an average of 12.34 minutes per
epoch. The final validation score is 1.39% with a testing score of 1.3%. These results were obtained on a
machine with an Intel Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
88
89
90
CHAPTER
NINE
Note: This section assumes the reader has already read through Classifying MNIST digits using Logistic
Regression and Multilayer Perceptron. Additionally it uses the following Theano functions and concepts :
T.tanh, shared variables, basic arithmetic ops, T.grad, Random numbers, floatX and scan. If you intend to
run the code on GPU also read GPU.
Note: The code for this section is available for download here.
eE(x)
.
Z
(9.1)
The normalizing factor Z is called the partition function by analogy with physical systems.
X
Z=
eE(x)
x
An energy-based model can be learnt by performing (stochastic) gradient descent on the empirical negative
log-likelihood of the training data. As for the logistic regression we will first define the log-likelihood and
then the loss function as being the negative log-likelihood.
L(, D) =
1 X
log p(x(i) )
N (i)
x
`(, D) = L(, D)
p(x
using the stochastic gradient log
(i) )
P (x, h) =
X eE(x,h)
h
(9.2)
In such cases, to map this formulation to one similar to Eq. (9.1), we introduce the notation (inspired from
physics) of free energy, defined as follows:
X
F(x) = log
eE(x,h)
(9.3)
h
X
eF (x)
with Z =
eF (x) .
Z
x
The data negative log-likelihood gradient then has a particularly interesting form.
log p(x)
F(x) X
F(
x)
=
p(
x)
.
(9.4)
Notice that the above gradient contains two terms, which are referred to as the positive and negative phase.
The terms positive and negative do not refer to the sign of each term in the equation, but rather reflect their
effect on the probability density defined by the model. The first term increases the probability of training
data (by reducing the corresponding free energy), while the second term decreases the probability of samples
generated by the model.
It is usually difficult to determine this gradient analytically, as it involves the computation of EP [ F(x) ].
This is nothing less than an expectation over all possible configurations of the input x (under the distribution
P formed by the model) !
The first step in making this computation tractable is to estimate the expectation using a fixed number of
model samples. Samples used to estimate the negative phase gradient are referred to as negative particles,
which are denoted as N . The gradient can then be written as:
log p(x)
F(x)
1 X F(
x)
|N |
(9.5)
x
N
92
(9.6)
where W represents the weights connecting hidden and visible units and b, c are the offsets of the visible
and hidden layers respectively.
This translates directly to the following free energy formula:
X
X
F(v) = b0 v
log
ehi (ci +Wi v) .
i
hi
Because of the specific structure of RBMs, visible and hidden units are conditionally independent given
one-another. Using this property, we can write:
Y
p(h|v) =
p(hi |v)
i
p(v|h) =
p(vj |h).
(9.7)
(9.8)
The free energy of an RBM with binary units further simplifies to:
X
F(v) = b0 v
log(1 + e(ci +Wi v) ).
(9.9)
93
bj
(9.10)
For a more detailed derivation of these equations, we refer the reader to the following page, or to section 5
of Learning Deep Architectures for AI. We will however not use these formulas, but rather get the gradient
using Theano T.grad from equation (9.4).
where h(n) refers to the set of all hidden units at the n-th step of the Markov chain. What it means is that, for
(n+1)
example, hi
is randomly chosen to be 1 (versus 0) with probability sigm(Wi0 v (n) + ci ), and similarly,
(n+1)
vj
is randomly chosen to be 1 (versus 0) with probability sigm(W.j h(n+1) + bj ).
This can be illustrated graphically:
94
9.3.2 Persistent CD
Persistent CD [Tieleman08] uses another approximation for sampling from p(v, h). It relies on a single
Markov chain, which has a persistent state (i.e., not restarting a chain for each observed example). For each
parameter update, we extract new samples by simply running the chain for k-steps. The state of the chain is
then preserved for subsequent updates.
The general intuition is that if parameter updates are small enough compared to the mixing rate of the chain,
the Markov chain should be able to catch up to changes in the model.
9.4 Implementation
We construct an RBM class. The parameters of the network can either be initialized by the constructor or can
be passed as arguments. This option is useful when an RBM is used as the building block of a deep network,
in which case the weight matrix and the hidden layer bias is shared with the corresponding sigmoidal layer
of an MLP network.
class RBM(object):
"""Restricted Boltzmann Machine (RBM) """
def __init__(
self,
input=None,
n_visible=784,
n_hidden=500,
W=None,
hbias=None,
vbias=None,
numpy_rng=None,
theano_rng=None
):
"""
RBM constructor. Defines the parameters of the model along with
9.4. Implementation
95
96
n_hidden,
dtype=theano.config.floatX
),
name=hbias,
borrow=True
)
if vbias is None:
# create shared variable for visible units bias
vbias = theano.shared(
value=numpy.zeros(
n_visible,
dtype=theano.config.floatX
),
name=vbias,
borrow=True
)
# initialize input layer for standalone RBM or layer0 of DBN
self.input = input
if not input:
self.input = T.matrix(input)
self.W = W
self.hbias = hbias
self.vbias = vbias
self.theano_rng = theano_rng
# **** WARNING: It is not a good idea to put things in this list
# other than shared variables created in this function.
self.params = [self.W, self.hbias, self.vbias]
Next step is to define functions which construct the symbolic graph associated with Eqs. (9.7) - (9.8). The
code is as follows:
def propup(self, vis):
This function propagates the visible units activation upwards to
the hidden units
Note that we return also the pre-sigmoid activation of the
layer. As it will turn out later, due to how Theano deals with
optimizations, this symbolic variable will be needed to write
down a more stable computational graph (see details in the
reconstruction cost function)
9.4. Implementation
97
We can then use these functions to define the symbolic graph for a Gibbs sampling step. We define two
functions:
gibbs_vhv which performs a step of Gibbs sampling starting from the visible units. As we shall
see, this will be useful for sampling from the RBM.
gibbs_hvh which performs a step of Gibbs sampling starting from the hidden units. This function
will be useful for performing CD and PCD updates.
The code is as follows:
def gibbs_hvh(self, h0_sample):
This function implements one step of Gibbs sampling,
starting from the hidden state
pre_sigmoid_v1, v1_mean, v1_sample = self.sample_v_given_h(h0_sample)
pre_sigmoid_h1, h1_mean, h1_sample = self.sample_h_given_v(v1_sample)
return [pre_sigmoid_v1, v1_mean, v1_sample,
pre_sigmoid_h1, h1_mean, h1_sample]
98
Note that we also return the pre-sigmoid activation. To understand why this is so you need to understand a
bit about how Theano works. Whenever you compile a Theano function, the computational graph that you
pass as input gets optimized for speed and stability. This is done by changing several parts of the subgraphs
with others. One such optimization expresses terms of the form log(sigmoid(x)) in terms of softplus. We
need this optimization for the cross-entropy since sigmoid of numbers larger than 30. (or even less then
that) turn to 1. and numbers smaller than -30. turn to 0 which in terms will force theano to compute log(0)
and therefore we will get either -inf or NaN as cost. If the value is expressed in terms of softplus we do not
get this undesirable behaviour. This optimization usually works fine, but here we have a special case. The
sigmoid is applied inside the scan op, while the log is outside. Therefore Theano will only see log(scan(..))
instead of log(sigmoid(..)) and will not apply the wanted optimization. We can not go and replace the
sigmoid in scan with something else also, because this only needs to be done on the last step. Therefore the
easiest and more efficient way is to get also the pre-sigmoid activation as an output of scan, and apply both
the log and sigmoid outside scan such that Theano can catch and optimize the expression.
The class also has a function that computes the free energy of the model, needed for computing the gradient
of the parameters (see Eq. (9.4)). Note that we also return the pre-sigmoid
def free_energy(self, v_sample):
Function to compute the free energy
wx_b = T.dot(v_sample, self.W) + self.hbias
vbias_term = T.dot(v_sample, self.vbias)
hidden_term = T.sum(T.log(1 + T.exp(wx_b)), axis=1)
return -hidden_term - vbias_term
We then add a get_cost_updates method, whose purpose is to generate the symbolic gradients for
CD-k and PCD-k updates.
def get_cost_updates(self, lr=0.1, persistent=None, k=1):
"""This functions implements one step of CD-k or PCD-k
:param lr: learning rate used to train the RBM
:param persistent: None for CD. For PCD, shared variable
containing old state of Gibbs chain. This must be a shared
variable of size (batch size, number of hidden units).
:param k: number of Gibbs steps to do in CD-k/PCD-k
Returns a proxy for the cost and the updates dictionary. The
dictionary contains the update rules for weights and biases but
also an update of the shared variable used to store the persistent
chain, if one is used.
9.4. Implementation
99
"""
# compute positive phase
pre_sigmoid_ph, ph_mean, ph_sample = self.sample_h_given_v(self.input)
# decide how to initialize persistent chain:
# for CD, we use the newly generate hidden sample
# for PCD, we initialize from the old state of the chain
if persistent is None:
chain_start = ph_sample
else:
chain_start = persistent
Note that get_cost_updates takes as argument a variable called persistent. This allows us to use
the same code to implement both CD and PCD. To use PCD, persistent should refer to a shared variable
which contains the state of the Gibbs chain from the previous iteration.
If persistent is None, we initialize the Gibbs chain with the hidden sample generated during the positive
phase, therefore implementing CD. Once we have established the starting point of the chain, we can then
compute the sample at the end of the Gibbs chain, sample that we need for getting the gradient (see Eq.
(9.4)). To do so, we will use the scan op provided by Theano, therefore we urge the reader to look it up by
following this link.
#
#
#
#
#
#
(
],
updates
) = theano.scan(
self.gibbs_hvh,
# the None are place holders, saying that
# chain_start is the initial state corresponding to the
# 6th output
outputs_info=[None, None, None, None, None, chain_start],
n_steps=k
)
Once we have the generated the chain we take the sample at the end of the chain to get the free energy of the
negative phase. Note that the chain_end is a symbolical Theano variable expressed in terms of the model
parameters, and if we would apply T.grad naively, the function will try to go through the Gibbs chain to
get the gradients. This is not what we want (it will mess up our gradients) and therefore we need to indicate
100
Finally, we add to the updates dictionary returned by scan (which contains updates rules for random states
of theano_rng) to contain the parameter updates. In the case of PCD, these should also update the shared
variable containing the state of the Gibbs chain.
# constructs the update dictionary
for gparam, param in zip(gparams, self.params):
# make sure that the learning rate is of the right dtype
updates[param] = param - gparam * T.cast(
lr,
dtype=theano.config.floatX
)
if persistent:
# Note that this works only if persistent is a shared variable
updates[persistent] = nh_samples[-1]
# pseudo-likelihood is a better proxy for PCD
monitoring_cost = self.get_pseudo_likelihood_cost(updates)
else:
# reconstruction cross-entropy is a better proxy for CD
monitoring_cost = self.get_reconstruction_cost(updates,
pre_sigmoid_nvs[-1])
return monitoring_cost, updates
101
gray-scale image (after reshaping to a square matrix). Filters should pick out strong features in the data.
While it is not clear for an arbitrary dataset, what these features should look like, training on MNIST usually
results in filters which act as stroke detectors, while training on natural images lead to Gabor like filters if
trained in conjunction with a sparsity criteria.
Proxies to Likelihood
Other, more tractable functions can be used as a proxy to the likelihood. When training an RBM with PCD,
one can use pseudo-likelihood as the proxy. Pseudo-likelihood (PL) is much less expensive to compute, as
it assumes that all bits are independent. Therefore,
Y
P L(x) =
P (xi |xi ) and
i
log P L(x) =
Here xi denotes the set of all bits of x except bit i. The log-PL is therefore the sum of the log-probabilities
of each bit xi , conditioned on the state of all other bits. For MNIST, this would involve summing over the
784 input dimensions, which remains rather expensive. For this reason, we use the following stochastic
approximation to log-PL:
g = N log P (xi |xi ), where i U (0, N ), , and
E[g] = log P L(x)
where the expectation is taken over the uniform random choice of index i, and N is the number of visible
units. In order to work with binary units, we further introduce the notation x
i to refer to x with bit-i being
flipped (1->0, 0->1). The log-PL for an RBM with binary units is then written as:
eF E(x)
eF E(x) + eF E(xi )
N log[sigm(F E(
xi ) F E(x))]
We therefore return this cost as well as the RBM updates in the get_cost_updates function of the RBM
class. Notice that we modify the updates dictionary to increment the index of bit i. This will result in bit i
cycling over all possible values {0, 1, ..., N }, from one update to another.
Note that for CD training the cross-entropy cost between the input and the reconstruction (the same as the
one used for the de-noising autoencoder) is more reliable then the pseudo-loglikelihood. Here is the code
we use to compute the pseudo-likelihood:
def get_pseudo_likelihood_cost(self, updates):
"""Stochastic approximation to the pseudo-likelihood"""
# index of bit i in expression p(x_i | x_{\i})
bit_i_idx = theano.shared(value=0, name=bit_i_idx)
# binarize the input image by rounding to nearest integer
xi = T.round(self.input)
102
9.4. Implementation
103
Once the RBM is trained, we can then use the gibbs_vhv function to implement the Gibbs chain required
for sampling. We initialize the Gibbs chain starting from test examples (although we could as well pick it
from the training set) in order to speed up convergence and avoid problems with random initialization. We
again use Theanos scan op to do 1000 steps before each plotting.
#################################
#
Sampling from the RBM
#
#################################
# find out the number of test samples
number_of_test_samples = test_set_x.get_value(borrow=True).shape[0]
# pick random test examples, with which to initialize the persistent chain
test_idx = rng.randint(number_of_test_samples - n_chains)
persistent_vis_chain = theano.shared(
numpy.asarray(
test_set_x.get_value(borrow=True)[test_idx:test_idx + n_chains],
dtype=theano.config.floatX
)
)
Next we create the 20 persistent chains in parallel to get our samples. To do so, we compile a theano function
which performs one Gibbs step and updates the state of the persistent chain with the new visible sample. We
apply this function iteratively for a large number of steps, plotting the samples at every 1000 steps.
104
plot_every = 1000
# define one step of Gibbs sampling (mf = mean-field) define a
# function that does plot_every steps before returning the
# sample for plotting
(
[
presig_hids,
hid_mfs,
hid_samples,
presig_vis,
vis_mfs,
vis_samples
],
updates
) = theano.scan(
rbm.gibbs_vhv,
outputs_info=[None, None, None, None, None, persistent_vis_chain],
n_steps=plot_every
)
# add to updates the shared variable that takes care of our persistent
# chain :.
updates.update({persistent_vis_chain: vis_samples[-1]})
# construct the function that implements our persistent chain.
# we generate the "mean field" activations for plotting and the actual
# samples for reinitializing the state of our persistent chain
sample_fn = theano.function(
[],
[
vis_mfs[-1],
vis_samples[-1]
],
updates=updates,
name=sample_fn
)
# create a space to store the image for plotting ( we need to leave
# room for the tile_spacing as well)
image_data = numpy.zeros(
(29 * n_samples + 1, 29 * n_chains - 1),
dtype=uint8
)
for idx in xrange(n_samples):
# generate plot_every intermediate samples that we discard,
# because successive samples in the chain are too correlated
vis_mf, vis_sample = sample_fn()
print ... plotting sample , idx
image_data[29 * idx:29 * idx + 28, :] = tile_raster_images(
X=vis_mf,
img_shape=(28, 28),
tile_shape=(1, n_chains),
tile_spacing=(1, 1)
)
9.4. Implementation
105
# construct image
image = Image.fromarray(image_data)
image.save(samples.png)
9.5 Results
We ran the code with PCD-15, learning rate of 0.1 and a batch size of 20, for 15 epochs. Training the model
takes 122.466 minutes on a Intel Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
The output was the following:
... loading data
Training epoch 0, cost is -90.6507246003
Training epoch 1, cost is -81.235857373
Training epoch 2, cost is -74.9120966945
Training epoch 3, cost is -73.0213216101
Training epoch 4, cost is -68.4098570497
Training epoch 5, cost is -63.2693021647
Training epoch 6, cost is -65.99578971
Training epoch 7, cost is -68.1236650015
Training epoch 8, cost is -68.3207365087
Training epoch 9, cost is -64.2949797113
Training epoch 10, cost is -61.5194867893
Training epoch 11, cost is -61.6539369402
Training epoch 12, cost is -63.5465278086
Training epoch 13, cost is -63.3787093527
Training epoch 14, cost is -62.755739271
Training took 122.466000 minutes
... plotting sample 0
... plotting sample 1
... plotting sample 2
... plotting sample 3
... plotting sample 4
... plotting sample 5
... plotting sample 6
... plotting sample 7
... plotting sample 8
... plotting sample 9
106
9.5. Results
107
108
CHAPTER
TEN
Note: This section assumes the reader has already read through Classifying MNIST digits using Logistic
Regression and Multilayer Perceptron and Restricted Boltzmann Machines (RBM). Additionally it uses the
following Theano functions and concepts : T.tanh, shared variables, basic arithmetic ops, T.grad, Random
numbers, floatX. If you intend to run the code on GPU also read GPU.
Note: The code for this section is available for download here.
where x = h0 , P (hk1 |hk ) is a conditional distribution for the visible units conditioned on the hidden units
of the RBM at level k, and P (h`1 , h` ) is the visible-hidden joint distribution in the top-level RBM. This is
illustrated in the figure below.
The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building
blocks for each layer [Hinton06], [Bengio07]. The process is as follows:
1. Train the first layer as an RBM that models the raw input x = h(0) as its visible layer.
2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two
common solutions exist. This representation can be chosen as being the mean activations p(h(1) = 1|h(0) )
or samples of p(h(1) |h(0) ).
3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training
examples (for the visible layer of that RBM).
4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean
values.
109
5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood,
or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned
representation into supervised predictions, e.g. a linear classifier).
In this tutorial, we focus on fine-tuning via supervised gradient descent. Specifically, we use a logistic
regression classifier to classify the input x based on the output of the last hidden layer h(l) of the DBN.
Fine-tuning is then performed via supervised gradient descent of the negative log-likelihood cost function.
Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e.
null for the visible biases of each RBM), this procedure is equivalent to initializing the parameters of a deep
MLP with the weights and hidden layer biases obtained with the unsupervised training strategy.
(10.2)
KL(Q(h(1) |x)||p(h(1) |x)) represents the KL divergence between the posterior Q(h(1) |x) of the first RBM
if it were standalone, and the probability p(h(1) |x) for the same layer but defined by the entire DBN (i.e.
taking into account the prior p(h(1) , h(2) ) defined by the top-level RBM). HQ(h(1) |x) is the entropy of the
distribution Q(h(1) |x).
T
It can be shown that if we initialize both hidden layers such that W (2) = W (1) , Q(h(1) |x) = p(h(1) |x)
and the KL divergence term is null. If we learn the first level RBM and then keep its parameters W (1) fixed,
optimizing Eq. (10.2) with respect to W (2) can thus only increase the likelihood p(x).
110
Also, notice that if we isolate the terms which depend only on W (2) , we get:
X
Q(h(1) |x)p(h(1) )
h
Optimizing this with respect to W (2) amounts to training a second-stage RBM, using the output of Q(h(1) |x)
as the training distribution, when x is sampled from the training distribution for the first RBM.
10.3 Implementation
To implement DBNs in Theano, we will use the class defined in the Restricted Boltzmann Machines (RBM)
tutorial. One can also observe that the code for the DBN is very similar with the one for SdA, because both
involve the principle of unsupervised layer-wise pre-training followed by supervised fine-tuning as a deep
MLP. The main difference is that we use the RBM class instead of the dA class.
We start off by defining the DBN class which will store the layers of the MLP, along with their associated
RBMs. Since we take the viewpoint of using the RBMs to initialize an MLP, the code will reflect this by
seperating as much as possible the RBMs used to initialize the network and the MLP used for classification.
class DBN(object):
"""Deep Belief Network
A deep belief network is obtained by stacking several RBMs on top of each
other. The hidden layer of the RBM at layer i becomes the input of the
RBM at layer i+1. The first layer RBM gets as input the input of the
network, and the hidden layer of the last RBM represents the output. When
used for classification, the DBN is treated as a MLP, by adding a logistic
regression layer on top.
"""
def __init__(self, numpy_rng, theano_rng=None, n_ins=784,
hidden_layers_sizes=[500, 500], n_outs=10):
"""This class is made to support a variable number of layers.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: numpy random number generator used to draw initial
weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
:param theano_rng: Theano random generator; if None is given one is
generated based on a seed drawn from rng
:type n_ins: int
:param n_ins: dimension of the input to the DBN
:type hidden_layers_sizes: list of ints
:param hidden_layers_sizes: intermediate layers size, must contain
at least one value
:type n_outs: int
10.3. Implementation
111
self.sigmoid_layers will store the feed-forward graphs which together form the MLP, while
self.rbm_layers will store the RBMs used to pretrain each layer of the MLP.
Next step, we construct n_layers sigmoid layers (we use the HiddenLayer class introduced in Multilayer Perceptron, with the only modification that we replaced the non-linearity from tanh to the logistic
function s(x) = 1+e1x ) and n_layers RBMs, where n_layers is the depth of our model. We link the
sigmoid layers such that they form an MLP, and construct each RBM such that they share the weight matrix
and the hidden bias with its corresponding sigmoid layer.
for i in xrange(self.n_layers):
# construct the sigmoidal layer
# the size of the input is either the number of hidden
# units of the layer below or the input size if we are on
# the first layer
if i == 0:
input_size = n_ins
else:
input_size = hidden_layers_sizes[i - 1]
# the input to this layer is either the activation of the
# hidden layer below or the input of the DBN if you are on
# the first layer
if i == 0:
layer_input = self.x
else:
layer_input = self.sigmoid_layers[-1].output
sigmoid_layer = HiddenLayer(rng=numpy_rng,
input=layer_input,
n_in=input_size,
n_out=hidden_layers_sizes[i],
activation=T.nnet.sigmoid)
# add the layer to our list of layers
112
self.sigmoid_layers.append(sigmoid_layer)
# its arguably a philosophical question...
# going to only declare that the parameters
# sigmoid_layers are parameters of the DBN.
# biases in the RBM are parameters of those
# of the DBN.
self.params.extend(sigmoid_layer.params)
but we are
of the
The visible
RBMs, but not
All that is left is to stack one last logistic regression layer in order to form an MLP. We will use the
LogisticRegression class introduced in Classifying MNIST digits using Logistic Regression.
self.logLayer = LogisticRegression(
input=self.sigmoid_layers[-1].output,
n_in=hidden_layers_sizes[-1],
n_out=n_outs)
self.params.extend(self.logLayer.params)
# compute the cost for second phase of training, defined as the
# negative log likelihood of the logistic regression (output) layer
self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)
# compute the gradients with respect to the model parameters
# symbolic variable that points to the number of errors made on the
# minibatch given by self.x and self.y
self.errors = self.logLayer.errors(self.y)
The class also provides a method which generates training functions for each of the RBMs. They are returned
as a list, where element i is a function which implements one step of training for the RBM at layer i.
def pretraining_functions(self, train_set_x, batch_size, k):
Generates a list of functions, for performing one step of
gradient descent at a given layer. The function will require
as input the minibatch index, and to train an RBM you just
need to iterate, calling the corresponding function on all
minibatch indexes.
:type train_set_x: theano.tensor.TensorType
:param train_set_x: Shared var. that contains all datapoints used
for training the RBM
:type batch_size: int
:param batch_size: size of a [mini]batch
:param k: number of Gibbs steps to do in CD-k / PCD-k
10.3. Implementation
113
# index to a [mini]batch
index = T.lscalar(index)
# index to a minibatch
In order to be able to change the learning rate during training, we associate a Theano variable to it that has
a default value.
learning_rate = T.scalar(lr)
# number of batches
n_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
# begining of a batch, given index
batch_begin = index * batch_size
# ending of a batch given index
batch_end = batch_begin + batch_size
pretrain_fns = []
for rbm in self.rbm_layers:
# get the cost and the updates list
# using CD-k here (persisent=None) for training each RBM.
# TODO: change cost function to reconstruction error
cost, updates = rbm.get_cost_updates(learning_rate,
persistent=None, k=k)
# compile the theano function
fn = theano.function(
inputs=[index, theano.Param(learning_rate, default=0.1)],
outputs=cost,
updates=updates,
givens={
self.x: train_set_x[batch_begin:batch_end]
}
)
# append fn to the list of functions
pretrain_fns.append(fn)
return pretrain_fns
Now any function pretrain_fns[i] takes as arguments index and optionally lr the learning rate.
Note that the names of the parameters are the names given to the Theano variables (e.g. lr) when they are
constructed and not the name of the python variables (e.g. learning_rate). Keep this in mind when
working with Theano. Optionally, if you provide k (the number of Gibbs steps to perform in CD or PCD)
this will also become an argument of your function.
In the same fashion, the DBN class includes a method for building the functions required for finetuning ( a
train_model, a validate_model and a test_model function).
def build_finetune_functions(self, datasets, batch_size, learning_rate):
Generates a function train that implements one step of
finetuning, a function validate that computes the error on a
batch from the validation set, and a function test that
114
# index to a [mini]batch
10.3. Implementation
115
self.x: test_set_x[
index * batch_size: (index + 1) * batch_size
],
self.y: test_set_y[
index * batch_size: (index + 1) * batch_size
]
}
)
valid_score_i = theano.function(
[index],
self.errors,
givens={
self.x: valid_set_x[
index * batch_size: (index + 1) * batch_size
],
self.y: valid_set_y[
index * batch_size: (index + 1) * batch_size
]
}
)
# Create a function that scans the entire validation set
def valid_score():
return [valid_score_i(i) for i in xrange(n_valid_batches)]
# Create a function that scans the entire test set
def test_score():
return [test_score_i(i) for i in xrange(n_test_batches)]
return train_fn, valid_score, test_score
Note that the returned valid_score and test_score are not Theano functions, but rather Python
functions. These loop over the entire validation set and the entire test set to produce a list of the losses
obtained over these sets.
There are two stages in training this network: (1) a layer-wise pre-training and (2) a fine-tuning stage.
For the pre-training stage, we loop over all the layers of the network. For each layer, we use the compiled theano function which determines the input to the i-th level RBM and performs one step of CD-
116
k within this RBM. This function is applied to the training set for a fixed number of epochs given by
pretraining_epochs.
#########################
# PRETRAINING THE MODEL #
#########################
print ... getting the pretraining functions
pretraining_fns = dbn.pretraining_functions(train_set_x=train_set_x,
batch_size=batch_size,
k=k)
print ... pre-training the model
start_time = timeit.default_timer()
## Pre-train layer-wise
for i in xrange(dbn.n_layers):
# go through pretraining epochs
for epoch in xrange(pretraining_epochs):
# go through the training set
c = []
for batch_index in xrange(n_train_batches):
c.append(pretraining_fns[i](index=batch_index,
lr=pretrain_lr))
print Pre-training layer %i, epoch %d, cost % (i, epoch),
print numpy.mean(c)
end_time = timeit.default_timer()
The fine-tuning loop is very similar to the one in the Multilayer Perceptron tutorial, the only difference being
that we now use the functions given by build_finetune_functions.
With the default parameters, the code runs for 100 pre-training epochs with mini-batches of size 10. This
corresponds to performing 500,000 unsupervised parameter updates. We use an unsupervised learning rate
of 0.01, with a supervised learning rate of 0.1. The DBN itself consists of three hidden layers with 1000
units per layer. With early-stopping, this configuration achieved a minimal validation error of 1.27 with
corresponding test error of 1.34 after 46 supervised epochs.
On an Intel(R) Xeon(R) CPU X5560 running at 2.80GHz, using a multi-threaded MKL library (running on
4 cores), pretraining took 615 minutes with an average of 2.05 mins/(layer * epoch). Fine-tuning took only
101 minutes or approximately 2.20 mins/epoch.
Hyper-parameters were selected by optimizing on the validation error. We tested unsupervised learning
rates in {101 , ..., 105 } and supervised learning rates in {101 , ..., 104 }. We did not use any form of
regularization besides early-stopping, nor did we optimize over the number of pretraining updates.
117
118
CHAPTER
ELEVEN
Note: This is an advanced tutorial, which shows how one can implemented Hybrid Monte-Carlo (HMC)
sampling using Theano. We assume the reader is already familiar with Theano and energy-based models
such as the RBM.
Note: The code for this section is available for download here.
11.1 Theory
Maximum likelihood learning of energy-based models requires a robust algorithm to sample negative phase
particles (see Eq.(4) of the Restricted Boltzmann Machines (RBM) tutorial). When training RBMs with CD
or PCD, this is typically done with block Gibbs sampling, where the conditional distributions p(h|v) and
p(v|h) are used as the transition operators of the Markov chain.
In certain cases however, these conditional distributions might be difficult to sample from (i.e. requiring
expensive matrix inversions, as in the case of the mean-covariance RBM). Also, even if Gibbs sampling
can be done efficiently, it nevertheless operates via a random walk which might not be statistically efficient
for some distributions. In this context, and when sampling from continuous variables, Hybrid Monte Carlo
(HMC) can prove to be a powerful tool [Duane87]. It avoids random walk behavior by simulating a physical system governed by Hamiltonian dynamics, potentially avoiding tricky conditional distributions in the
process.
In HMC, model samples are obtained by simulating a physical system, where particles move about a highdimensional landscape, subject to potential and kinetic energies. Adapting the notation from [Neal93],
particles are characterized by a position vector or state s RD and velocity vector RD . The combined
state of a particle is denoted as = (s, ). The Hamiltonian is then defined as the sum of potential energy
E(s) (same energy function defined by energy-based models) and kinetic energy K(), as follows:
H(s, ) = E(s) + K() = E(s) +
1X 2
i
2
i
Instead of sampling p(s) directly, HMC operates by sampling from the canonical distribution p(s, ) =
1
Z exp(H(s, )) = p(s)p(). Because the two variables are independent, marginalizing over is trivial
and recovers the original distribution of interest.
Hamiltonian Dynamics
119
State s and velocity are modified such that H(s, ) remains constant throughout the simulation. The
differential equations are given by:
dsi
H
= i
=
dt
i
di
E
H
=
=
dt
si
si
(11.1)
As shown in [Neal93], the above transformation preserves volume and is reversible. The above dynamics
can thus be used as transition operators of a Markov chain and will leave p(s, ) invariant. That chain by
itself is not ergodic however, since simulating the dynamics maintains a fixed Hamiltonian H(s, ). HMC
thus alternates hamiltonian dynamic steps, with Gibbs sampling of the velocity. Because p(s) and p() are
independent, sampling new p(|s) is trivial since p(|s) = p(), where p() is often taken to be the
uni-variate Gaussian.
The Leap-Frog Algorithm
In practice, we cannot simulate Hamiltonian dynamics exactly because of the problem of time discretization.
There are several ways one can do this. To maintain invariance of the Markov chain however, care must be
taken to preserve the properties of volume conservation and time reversibility. The leap-frog algorithm
maintains these properties and operates in 3 steps:
E(s(t))
2 si
si (t + ) = si (t) + i (t + /2)
i (t + ) = i (t + /2)
E(s(t + ))
2 si
i (t + /2) = i (t)
(11.2)
We thus perform a half-step update of the velocity at time t + /2, which is then used to compute s(t + )
and (t + ).
Accept / Reject
In practice, using finite stepsizes will not preserve H(s, ) exactly and will introduce bias in the simulation.
Also, rounding errors due to the use of floating point numbers means that the above transformation will not
be perfectly reversible.
HMC cancels these effects exactly by adding a Metropolis accept/reject stage, after n leapfrog steps. The
new state 0 = (s0 , 0 ) is accepted with probability pacc (, 0 ), defined as:
exp(H(s0 , 0 )
0
pacc (, ) = min 1,
exp(H(s, )
HMC Algorithm
In this tutorial, we obtain a new HMC sample as follows:
1. sample a new velocity from a univariate Gaussian distribution
2. perform n leapfrog steps to obtain the new state 0
3. perform accept/reject move of 0
120
i (t + /2) = i (t)
E(s(t + (m 1)))
si
(11.3)
121
The simulated ynamics function performs the full algorithm of Eqs. (11.3). We start with the initial halfstep update of and full-step of s, and then scan over the leapf rog method n_steps 1 times.
def simulate_dynamics(initial_pos, initial_vel, stepsize, n_steps, energy_fn):
"""
Return final (position, velocity) obtained after an n_steps leapfrog
updates, using Hamiltonian dynamics.
Parameters
---------initial_pos: shared theano matrix
Initial position at which to start the simulation
initial_vel: shared theano matrix
Initial velocity of particles
stepsize: shared theano scalar
Scalar value controlling amount by which to move
energy_fn: python function
Python function, operating on symbolic theano variables, used to
compute the potential energy at a given position.
Returns
------rval1: theano matrix
Final positions obtained after simulation
rval2: theano matrix
Final velocity obtained after simulation
"""
def leapfrog(pos, vel, step):
"""
Inside loop of Scan. Performs one step of leapfrog update, using
Hamiltonian dynamics.
122
Parameters
---------pos: theano matrix
in leapfrog update equations, represents pos(t), position at time t
vel: theano matrix
in leapfrog update equations, represents vel(t - stepsize/2),
velocity at time (t - stepsize/2)
step: theano scalar
scalar value controlling amount by which to move
Returns
------rval1: [theano matrix, theano matrix]
Symbolic theano matrices for new position pos(t + stepsize), and
velocity vel(t + stepsize/2)
rval2: dictionary
Dictionary of updates for the Scan Op
"""
# from pos(t) and vel(t-stepsize/2), compute vel(t+stepsize/2)
dE_dpos = TT.grad(energy_fn(pos).sum(), pos)
new_vel = vel - step * dE_dpos
# from vel(t+stepsize/2) compute pos(t+stepsize)
new_pos = pos + step * new_vel
return [new_pos, new_vel], {}
# compute velocity at time-step: t + stepsize/2
initial_energy = energy_fn(initial_pos)
dE_dpos = TT.grad(initial_energy.sum(), initial_pos)
vel_half_step = initial_vel - 0.5 * stepsize * dE_dpos
# compute position at time-step: t + stepsize
pos_full_step = initial_pos + stepsize * vel_half_step
# perform leapfrog updates: the scan op is used to repeatedly compute
# vel(t + (m-1/2)*stepsize) and pos(t + m*stepsize) for m in [2,n_steps].
(all_pos, all_vel), scan_updates = theano.scan(
leapfrog,
outputs_info=[
dict(initial=pos_full_step),
dict(initial=vel_half_step),
],
non_sequences=[stepsize],
n_steps=n_steps - 1)
final_pos = all_pos[-1]
final_vel = all_vel[-1]
# NOTE: Scan always returns an updates dictionary, in case the
# scanned function draws samples from a RandomStream. These
# updates must then be used when compiling the Theano function, to
# avoid drawing the same random numbers each time the function is
# called. In this case however, we consciously ignore
# "scan_updates" because we know it is empty.
assert not scan_updates
123
# start-snippet-1
A final half-step is performed to compute (t + n), and the final proposed state 0 is returned.
hmc_move
The hmc_move function implements the remaining steps (steps 1 and 3) of an HMC move proposal (while
wrapping the simulate_dynamics function). Given a matrix of initial states s RN D (positions) and
energy function E(s) (energy_f n), it defines the symbolic graph for computing n_steps of HMC, using a
given stepsize. The function prototype is as follows:
def hmc_move(s_rng, positions, energy_fn, stepsize, n_steps):
"""
This function performs one-step of Hybrid Monte-Carlo sampling. We start by
sampling a random velocity from a univariate Gaussian distribution, perform
n_steps leap-frog updates using Hamiltonian dynamics and accept-reject
using Metropolis-Hastings.
Parameters
---------s_rng: theano shared random stream
Symbolic random number generator used to draw random velocity and
perform accept-reject move.
positions: shared theano matrix
Symbolic matrix whose rows are position vectors.
energy_fn: python function
Python function, operating on symbolic theano variables, used to
compute the potential energy at a given position.
stepsize: shared theano scalar
Shared variable containing the stepsize to use for n_steps of HMC
simulation steps.
n_steps: integer
Number of HMC steps to perform before proposing a new position.
Returns
------rval1: boolean
True if move is accepted, False otherwise
rval2: theano matrix
Matrix whose rows contain the proposed "new position"
"""
We start by sampling random velocities, using the provided shared RandomStream object. Velocities are
sampled independently for each dimension and for each particle under simulation, yielding a N D matrix.
124
Since we now have an initial position and velocity, we can now call the simulate_dynamics to obtain the
proposal for the new state 0 .
# perform simulation of particles subject to Hamiltonian dynamics
final_pos, final_vel = simulate_dynamics(
initial_pos=positions,
initial_vel=initial_vel,
stepsize=stepsize,
n_steps=n_steps,
energy_fn=energy_fn
)
125
Parameters
---------pos: theano matrix
Symbolic matrix whose rows are position vectors.
vel: theano matrix
Symbolic matrix whose rows are velocity vectors.
energy_fn: python function
Python function, operating on symbolic theano variables, used tox
compute the potential energy at a given position.
Returns
------return: theano vector
Vector whose i-th entry is the Hamiltonian at position pos[i] and
velocity vel[i].
"""
# assuming mass is 1
return energy_fn(pos) + kinetic_energy(vel)
def kinetic_energy(vel):
"""Returns the kinetic energy associated with the given velocity
and mass of 1.
Parameters
---------vel: theano matrix
Symbolic matrix whose rows are velocity vectors.
Returns
------return: theano vector
Vector whose i-th entry is the kinetic entry associated with vel[i].
"""
return 0.5 * (vel ** 2).sum(axis=1)
hmc_move finally returns the tuple (accept, f inal_pos). accept is a symbolic boolean variable indicating
whether or not the new state f inalp os should be used or not.
hmc_updates
The purpose of hmc_updates is to generate the list of updates to perform, whenever our HMC sampling function is called. hmc_updates thus receives as parameters, a series of shared variables to update
(positions, stepsize and avg_acceptance_rate), and the parameters required to compute their new state.
def hmc_updates(positions, stepsize, avg_acceptance_rate, final_pos, accept,
target_acceptance_rate, stepsize_inc, stepsize_dec,
stepsize_min, stepsize_max, avg_acceptance_slowness):
"""This function is executed after n_steps of HMC sampling
(hmc_move function). It creates the updates dictionary used by
the simulate function. It takes care of updating: the position
(if the move is accepted), the stepsize (to track a given target
acceptance rate) and the average acceptance rate (computed as a
126
moving average).
Parameters
---------positions: shared variable, theano matrix
Shared theano matrix whose rows contain the old position
stepsize: shared variable, theano scalar
Shared theano scalar containing current step size
avg_acceptance_rate: shared variable, theano scalar
Shared theano scalar containing the current average acceptance rate
final_pos: shared variable, theano matrix
Shared theano matrix whose rows contain the new position
accept: theano scalar
Boolean-type variable representing whether or not the proposed HMC move
should be accepted or not.
target_acceptance_rate: float
The stepsize is modified in order to track this target acceptance rate.
stepsize_inc: float
Amount by which to increment stepsize when acceptance rate is too high.
stepsize_dec: float
Amount by which to decrement stepsize when acceptance rate is too low.
stepsize_min: float
Lower-bound on stepsize.
stepsize_min: float
Upper-bound on stepsize.
avg_acceptance_slowness: float
Average acceptance rate is computed as an exponential moving average.
(1-avg_acceptance_slowness) is the weight given to the newest
observation.
Returns
------rval1: dictionary-like
A dictionary of updates to be used by the HMC_Sampler.simulate
function. The updates target the position, stepsize and average
acceptance rate.
"""
## POSITION UPDATES ##
# broadcast accept scalar to tensor with the same dimensions as
# final_pos.
accept_matrix = accept.dimshuffle(0, *((x,) * (final_pos.ndim - 1)))
# if accept is True, update to final_pos else stay put
new_positions = TT.switch(accept_matrix, final_pos, positions)
Using the above code, the dictionary positions : new_positions can be used to update the state of the
sampler with either (1) the new state f inal_pos if accept is True, or (2) the old state if accept is False. This
conditional assignment is performed by the switch op.
switch expects as its first argument, a boolean mask with the same broadcastable dimensions as the second
and third argument. Since accept is scalar-valued, we must first use dimshuffle to transform it to a tensor
with f inal_pos.ndim broadcastable dimensions (accept_matrix).
11.2. Implementing HMC Using Theano
127
hmc_updates additionally implements an adaptive version of HMC, as implemented in the accompanying code to [Ranzato10]. We start by tracking the average acceptance rate of the HMC move
proposals (across many simulations), using an exponential moving average with time constant 1
avg_acceptance_slowness.
## ACCEPT RATE UPDATES ##
# perform exponential moving average
mean_dtype = theano.scalar.upcast(accept.dtype, avg_acceptance_rate.dtype)
new_acceptance_rate = TT.add(
avg_acceptance_slowness * avg_acceptance_rate,
(1.0 - avg_acceptance_slowness) * accept.mean(dtype=mean_dtype))
If the average acceptance rate is larger than the target_acceptance_rate, we increase the stepsize by a
factor of stepsize_inc in order to increase the mixing rate of our chain. If the average acceptance rate is too
low however, stepsize is decreased by a factor of stepsize_dec, yielding a more conservative mixing rate.
The clip op allows us to maintain the stepsize in the range [stepsize_min, stepsize_max].
## STEPSIZE UPDATES ##
# if acceptance rate is too low, our sampler is too "noisy" and we reduce
# the stepsize. If it is too high, our sampler is too conservative, we can
# get away with a larger stepsize (resulting in better mixing).
_new_stepsize = TT.switch(avg_acceptance_rate > target_acceptance_rate,
stepsize * stepsize_inc, stepsize * stepsize_dec)
# maintain stepsize in [stepsize_min, stepsize_max]
new_stepsize = TT.clip(_new_stepsize, stepsize_min, stepsize_max)
HMC_sampler
We finally tie everything together using the HM C_Sampler class. Its main elements are:
new_f rom_shared_positions: a constructor method which allocates various shared variables and
strings together the calls to hmc_move and hmc_updates. It also builds the theano function
simulate, whose sole purpose is to execute the updates generated by hmc_updates.
draw: a convenience method which calls the Theano function simulate and returns a copy of the
contents of the shared variable self.positions.
class HMC_sampler(object):
"""
Convenience wrapper for performing Hybrid Monte Carlo (HMC). It creates the
symbolic graph for performing an HMC simulation (using hmc_move and
hmc_updates). The graph is then compiled into the simulate function, a
theano function which runs the simulation and updates the required shared
variables.
Users should interface with the sampler thorugh the draw function which
advances the markov chain and returns the current sample by calling
simulate and get_position in sequence.
128
129
shared_positions,
stepsize,
avg_acceptance_rate,
final_pos=final_pos,
accept=accept,
stepsize_min=stepsize_min,
stepsize_max=stepsize_max,
stepsize_inc=stepsize_inc,
stepsize_dec=stepsize_dec,
target_acceptance_rate=target_acceptance_rate,
avg_acceptance_slowness=avg_acceptance_slowness)
# compile theano function
simulate = function([], [], updates=simulate_updates)
# create HMC_sampler object with the following attributes ...
return cls(
positions=shared_positions,
stepsize=stepsize,
stepsize_min=stepsize_min,
stepsize_max=stepsize_max,
avg_acceptance_rate=avg_acceptance_rate,
target_acceptance_rate=target_acceptance_rate,
s_rng=s_rng,
_updates=simulate_updates,
simulate=simulate)
def draw(self, **kwargs):
"""
Returns a new position obtained after n_steps of HMC simulation.
Parameters
---------kwargs: dictionary
The kwargs dictionary is passed to the shared variable
(self.positions) get_value() function. For example, to avoid
copying the shared variable value, consider passing borrow=True.
Returns
------rval: numpy matrix
Numpy matrix whose of dimensions similar to initial_position.
"""
self.simulate()
return self.positions.get_value(borrow=False)
sampler by allocating a position shared variable. It is passed to the constructor of HM C_sampler along
with our target energy function.
Following a burn-in period, we then generate a large number of samples and compare the empirical mean
and covariance matrix to their true values.
def sampler_on_nd_gaussian(sampler_cls, burnin, n_samples, dim=10):
batchsize = 3
rng = numpy.random.RandomState(123)
# Define a covariance and mu for a gaussian
mu = numpy.array(rng.rand(dim) * 10, dtype=theano.config.floatX)
cov = numpy.array(rng.rand(dim, dim), dtype=theano.config.floatX)
cov = (cov + cov.T) / 2.
cov[numpy.arange(dim), numpy.arange(dim)] = 1.0
cov_inv = linalg.inv(cov)
# Define energy function for a multi-variate Gaussian
def gaussian_energy(x):
return 0.5 * (theano.tensor.dot((x - mu), cov_inv) *
(x - mu)).sum(axis=1)
# Declared shared random variable for positions
position = rng.randn(batchsize, dim).astype(theano.config.floatX)
position = theano.shared(position)
# Create HMC sampler
sampler = sampler_cls(position, gaussian_energy,
initial_stepsize=1e-3, stepsize_max=0.5)
# Start with a burn-in process
garbage = [sampler.draw() for r in xrange(burnin)] # burn-in Draw
# n_samples: result is a 3D tensor of dim [n_samples, batchsize,
# dim]
_samples = numpy.asarray([sampler.draw() for r in xrange(n_samples)])
# Flatten to [n_samples * batchsize, dim]
samples = _samples.T.reshape(dim, -1).T
print ****** TARGET VALUES ******
print target mean:, mu
print target cov:\n, cov
print ****** EMPIRICAL MEAN/COV USING HMC ******
print empirical mean: , samples.mean(axis=0)
print empirical_cov:\n, numpy.cov(samples.T)
print ****** HMC INTERNALS ******
print final stepsize, sampler.stepsize.get_value()
print final acceptance_rate, sampler.avg_acceptance_rate.get_value()
return sampler
131
def test_hmc():
sampler = sampler_on_nd_gaussian(HMC_sampler.new_from_shared_positions,
burnin=1000, n_samples=1000, dim=5)
assert abs(sampler.avg_acceptance_rate.get_value() sampler.target_acceptance_rate) < .1
assert sampler.stepsize.get_value() >= sampler.stepsize_min
assert sampler.stepsize.get_value() <= sampler.stepsize_max
The above code can be run using the command: nosetests -s code/hmc/test_hmc.py. The output is as
follows:
[desjagui@atchoum hmc]$ python test_hmc.py
****** TARGET VALUES ******
target mean: [ 6.96469186 2.86139335
target cov:
[[ 1.
0.66197111 0.71141257
[ 0.66197111 1.
0.31053199
[ 0.71141257 0.31053199 1.
[ 0.55766643 0.45455485 0.62800335
[ 0.35753822 0.37991646 0.38004541
2.26851454
5.51314769
7.1946897 ]
0.55766643
0.45455485
0.62800335
1.
0.50807871
0.35753822]
0.37991646]
0.38004541]
0.50807871]
1.
]]
7.19414496]
As can be seen above, the samples generated by our HMC sampler yield an empirical mean and covariance
matrix, which are very close to the true underlying parameters. The adaptive algorithm also seemed to work
well as the final acceptance rate is close to our target of 0.9.
11.4 References
132
CHAPTER
TWELVE
12.1 Summary
In this tutorial, you will learn how to:
learn Word Embeddings
using Recurrent Neural Networks architectures
with Context Windows
in order to perform Semantic Parsing / Slot-Filling (Spoken Language Understanding)
12.2.2 Papers
If you use this tutorial, cite the following papers:
[pdf] Grgoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. Investigation of Recurrent-NeuralNetwork Architectures and Learning Methods for Spoken Language Understanding. Interspeech,
2013.
[pdf] Gokhan Tur, Dilek Hakkani-Tur and Larry Heck. What is left to be understood in ATIS?
[pdf] Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken
language understanding. Interspeech, 2007.
[pdf] Bastien, Frdric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
[pdf] Bergstra, James, Breuleux, Olivier, Bastien, Frdric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and
133
GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference
(SciPy), June 2010.
Thank you!
12.2.3 Contact
Please email to Grgoire Mesnil for any problem report or feedback. We will be glad to hear from you.
12.3 Task
The Slot-Filling (Spoken Language Understanding) consists in assigning a label to each word given a sentence. Its a classification task.
12.4 Dataset
An old and small benchmark for this task is the ATIS (Airline Travel Information System) dataset collected
by DARPA. Here is a sentence (or utterance) example using the Inside Outside Beginning (IOB) representation.
Input (words)
Output (labels)
show
O
flights
O
from
O
Boston
B-dept
to
O
New
B-arr
York
I-arr
today
B-date
The ATIS offical split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence
length is 15) in the train/test set. The number of classes (different slots) is 128 including the O label (NULL).
As Microsoft Research people, we deal with unseen words in the test set by marking any words with only
one single occurrence in the training set as <UNK> and use this token to represent those unseen words in the
test set. As Ronan Collobert and colleagues, we converted sequences of numbers with the string DIGIT i.e.
1984 is converted to DIGITDIGITDIGITDIGIT.
We split the official train set into a training and validation set that contain respectively 80% and 20% of the
official training sentences. Significant performance improvement difference has to be greater than 0.6% in
F1 measure at the 95% level due to the small size of the dataset. For evaluation purpose, experiments have
to report the following metrics:
Precision
Recall
F1 score
We will use the conlleval PERL script to measure the performance of our models.
134
assert (win % 2) == 1
assert win >= 1
l = list(l)
lpadded = win // 2 * [-1] + l + win // 2 * [-1]
out = [lpadded[i:(i + win)] for i in range(len(l))]
assert len(out) == len(l)
return out
135
The index -1 corresponds to the PADDING index we insert at the beginning/end of the sentence.
Here is a sample:
>>> x
array([0, 1, 2, 3, 4], dtype=int32)
>>> contextwin(x, 3)
[[-1, 0, 1],
[ 0, 1, 2],
[ 1, 2, 3],
[ 2, 3, 4],
[ 3, 4,-1]]
>>> contextwin(x, 7)
[[-1, -1, -1, 0, 1, 2, 3],
[-1, -1, 0, 1, 2, 3, 4],
[-1, 0, 1, 2, 3, 4,-1],
[ 0, 1, 2, 3, 4,-1,-1],
[ 1, 2, 3, 4,-1,-1,-1]]
To summarize, we started with an array of indexes and ended with a matrix of indexes. Each line corresponds
to the context window surrounding this word.
idxs = T.imatrix() # as many columns as words in the context window and as many lines as wo
x
= self.emb[idxs].reshape((idxs.shape[0], de*cs))
The x symbolic variable corresponds to a matrix of shape (number of words in the sentences, dimension of
the embedding space X context window size).
Lets compile a theano function to do so
>>> sample
array([0, 1, 2, 3, 4], dtype=int32)
>>> csample = contextwin(sample, 7)
[[-1, -1, -1, 0, 1, 2, 3],
[-1, -1, 0, 1, 2, 3, 4],
[-1, 0, 1, 2, 3, 4,-1],
[ 0, 1, 2, 3, 4,-1,-1],
136
[ 1, 2, 3, 4,-1,-1,-1]]
>>> f = theano.function(inputs=[idxs], outputs=x)
>>> f(csample)
array([[-0.08088442, 0.08458307, 0.05064092, ..., 0.06876887,
-0.06648078, -0.15192257],
[-0.08088442, 0.08458307, 0.05064092, ..., 0.11192625,
0.08745284, 0.04381778],
[-0.08088442, 0.08458307, 0.05064092, ..., -0.00937143,
0.10804889, 0.1247109 ],
[ 0.11038255, -0.10563177, -0.18760249, ..., -0.00937143,
0.10804889, 0.1247109 ],
[ 0.18738101, 0.14727569, -0.069544 , ..., -0.00937143,
0.10804889, 0.1247109 ]], dtype=float32)
>>> f(csample).shape
(5, 350)
We now have a sequence (of length 5 which is corresponds to the length of the sentence) of context window
word embeddings which is easy to feed to a simple recurrent neural network to iterate with.
137
Then we integrate the way to build the input from the embedding matrix:
idxs = T.imatrix()
x = self.emb[idxs].reshape((idxs.shape[0], de*cs))
y_sentence = T.ivector(y_sentence) # labels
We use the scan operator to construct the recursion, works like a charm:
def recurrence(x_t, h_tm1):
h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
+ T.dot(h_tm1, self.wh) + self.bh)
s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
return [h_t, s_t]
[h, s], _ = theano.scan(fn=recurrence,
138
sequences=x,
outputs_info=[self.h0, None],
n_steps=x.shape[0])
p_y_given_x_sentence = s[:, 0, :]
y_pred = T.argmax(p_y_given_x_sentence, axis=1)
Theano will then compute all the gradients automatically to maximize the log-likelihood:
lr = T.scalar(lr)
sentence_nll = -T.mean(T.log(p_y_given_x_sentence)
[T.arange(x.shape[0]), y_sentence])
sentence_gradients = T.grad(sentence_nll, self.params)
sentence_updates = OrderedDict((p, p - lr*g)
for p, g in
zip(self.params, sentence_gradients))
We keep the word embeddings on the unit sphere by normalizing them after each update:
self.normalize = theano.function(inputs=[],
updates={self.emb:
self.emb /
T.sqrt((self.emb**2)
.sum(axis=1))
.dimshuffle(0, x)})
12.6 Evaluation
With the previous defined functions, you can compare the predicted labels with the true labels and compute
some metrics. In this repo, we build a wrapper around the conlleval PERL script. Its not trivial to compute
those metrics due to the Inside Outside Beginning (IOB) representation i.e. a prediction is considered correct
if the word-beginning and the word-inside and the word-outside predictions are all correct. Note that the
extension is txt and you will have to change it to pl.
12.6. Evaluation
139
12.7 Training
12.7.1 Updates
For stochastic gradient descent (SGD) update, we consider the whole sentence as a mini-batch and perform
one update per sentence. It is possible to perform a pure SGD (contrary to mini-batch) where the update is
done on only one single word at a time.
After each iteration/update, we normalize the word embeddings to keep them on a unit sphere.
(NEW BEST: epoch, 25, valid F1, 96.84, best test F1, 93.79)
[learning] epoch 26 >> 100.00% completed in 28.76 (sec) <<
[learning] epoch 27 >> 100.00% completed in 28.76 (sec) <<
...
(BEST RESULT: epoch, 57, valid F1, 97.23, best test F1, 94.2, with the model, rnns
12.8.1 Timing
Running experiments on ATIS using this repository will run one epoch in less than 40 seconds on i7 CPU
950 @ 3.07GHz using less than 200 Mo of RAM:
140
F1 94.19
F1 94.42
35.04 (sec) <<
34.80 (sec) <<
F1 94.34
35.18 (sec) <<
F1 94.48
35.39 (sec) <<
35.31 (sec) <<
back
live
lives
both
how
me
out
other
plane
service
fare
ap80
ap57
ap
connections
tomorrow
before
earliest
connect
thrift
coach
today
but
if
up
a
now
amount
more
abbreviation
restrictions
mean
interested
aircraft
plane
service
airplane
seating
stand
that
on
turboprop
mean
amount
business
coach
first
fourth
thrift
tenth
second
fifth
third
twelfth
sixth
a
people
do
but
numbers
abbreviation
if
up
serve
database
passengers
august
september
january
june
december
november
april
july
jfk
october
may
As you can judge, the limited size of the vocabulary (about 500 words) gives us mitigated performance.
According to human judgement: some are good, some are bad.
141
actually
provide
prices
stop
number
flight
there
serving
thank
ticket
are
ch
w
w
am
ea
sf
m
jfk
sh
bw
la
142
CHAPTER
THIRTEEN
13.1 Summary
This tutorial aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short
Term Memory (LSTM) architecture can be implemented using Theano. In this tutorial, this model is used
to perform sentiment analysis on movie reviews from the Large Movie Review Dataset, sometimes known
as the IMDB dataset.
In this task, given a movie review, the model attempts to predict whether it is positive or negative. This is a
binary classification task.
13.2 Data
As previously mentioned, the provided scripts are used to train a LSTM recurrent neural network on the
Large Movie Review Dataset dataset.
While the dataset is public, in this tutorial we provide a copy of the dataset that has previously been preprocessed according to the needs of this LSTM implementation. Running the code provided in this tutorial
will automatically download the data to the local directory. In order to use your own data, please use a
(preprocessing script) provided as a part of this tutorial.
Once the model is trained, you can test it with your own corpus using the word-index dictionary
(imdb.dict.pkl.gz) provided as a part of this tutorial.
13.3 Model
13.3.1 LSTM
In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal
can end up being multiplied a large number of times (as many as the number of timesteps) by the weight
matrix associated with the connections between the neurons of the recurrent hidden layer. This means that,
the magnitude of weights in the transition matrix can have a strong impact on the learning process.
If the weights in this matrix are small (or, more formally, if the leading eigenvalue of the weight matrix
is smaller than 1.0), it can lead to a situation called vanishing gradients where the gradient signal gets so
143
small that learning either becomes very slow or stops working altogether. It can also make more difficult
the task of learning long-term dependencies in the data. Conversely, if the weights in this matrix are large
(or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1.0), it can lead to a
situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to
as exploding gradients.
These issues are the main motivation behind the LSTM model which introduces a new structure called a
memory cell (see Figure 1 below). A memory cell is composed of four main elements: an input gate, a
neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. The selfrecurrent connection has a weight of 1.0 and ensures that, barring any outside interference, the state of a
memory cell can remain constant from one timestep to another. The gates serve to modulate the interactions
between the memory cell itself and its environment. The input gate can allow incoming signal to alter the
state of the memory cell or block it. On the other hand, the output gate can allow the state of the memory
cell to have an effect on other neurons or prevent it. Finally, the forget gate can modulate the memory cells
self-recurrent connection, allowing the cell to remember or forget its previous state, as needed.
144
it = (Wi xt + Ui ht1 + bi )
(13.1)
ft = tanh(Wc xt + Uc ht1 + bc )
C
(13.2)
Second, we compute the value for ft , the activation of the memory cells forget gates at time t :
ft = (Wf xt + Uf ht1 + bf )
(13.3)
ft ,
Given the value of the input gate activation it , the forget gate activation ft and the candidate state value C
we can compute Ct the memory cells new state at time t :
ft + ft Ct1
Ct = it C
(13.4)
With the new state of the memory cells, we can compute the value of their output gates and, subsequently,
their outputs :
ot = (Wo xt + Uo ht1 + Vo Ct + bo )
(13.5)
ht = ot tanh(Ct )
(13.6)
(13.7)
Our model is composed of a single LSTM layer followed by an average pooling and a logistic regression
layer as illustrated in Figure 2 below. Thus, from an input sequence x0 , x1 , x2 , ..., xn , the memory cells in
the LSTM layer will produce a representation sequence h0 , h1 , h2 , ..., hn . This representation sequence is
then averaged over all timesteps resulting in representation h. Finally, this representation is fed to a logistic
regression layer whose target is the class label associated with the input sequence.
Implementation note : In the code included this tutorial, the equations (13.1), (13.2), (13.3) and (13.7)
are performed in parallel to make the computation more efficient. This is possible because none of these
equations rely on a result produced by the other ones. It is achieved by concatenating the four matrices W
into a single weight matrix W and performing the same concatenation on the weight matrices U to produce
the matrix U and the bias vectors b to produce the vector b. Then, the pre-nonlinearity activations can be
computed with :
z = (W xt + U ht1 + b)
ft , and o and the non-linearities
The result is then sliced to obtain the pre-nonlinearity activations for i, f , C
are then applied independently for each.
145
Figure 13.2: Figure 2 : Illustration of the model used in this tutorial. It is composed of a single LSTM layer
followed by mean pooling over time and logistic regression.
146
The script will automatically download the data and decompress it.
Note : The provided code supports the Stochastic Gradient Descent (SGD), AdaDelta and RMSProp optimization methods. You are advised to use AdaDelta or RMSProp because SGD appears to performs poorly
on this task with this particular model.
13.4.2 Papers
If you use this tutorial, please cite the following papers.
Introduction of the LSTM model:
[pdf] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),
1735-1780.
Addition of the forget gate to the LSTM model:
[pdf] Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction
with LSTM. Neural computation, 12(10), 2451-2471.
More recent LSTM paper:
[pdf] Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer,
2012.
Papers related to Theano:
[pdf] Bastien, Frdric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
[pdf] Bergstra, James, Breuleux, Olivier, Bastien, Frdric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and
GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference
(SciPy), June 2010.
Thank you!
13.4.3 Contact
Please email Pierre Luc Carrier or Kyunghyun Cho for any problem report or feedback. We will be glad to
hear from you.
147
13.5 References
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 17351780.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with
LSTM. Neural computation, 12(10), 2451-2471.
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks (Vol. 385). Springer.
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent
is difficult. Neural Networks, IEEE Transactions on, 5(2), 157-166.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning
word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association
for Computational Linguistics.
148
CHAPTER
FOURTEEN
Note: This tutorial demonstrates a basic implementation of the RNN-RBM as described in [BoulangerLewandowski12] (pdf). We assume the reader is familiar with recurrent neural networks using the scan op
and restricted Boltzmann machines (RBM).
Note: The code for this section is available for download here: rnnrbm.py.
You will need the modified Python MIDI package (GPL license) in your $PYTHONPATH or in the working
directory in order to convert MIDI files to and from piano-rolls. The script also assumes that the content
of the Nottingham Database of folk tunes has been extracted in the ../data directory. Alternative MIDI
datasets are available here.
Note that both dependencies above can be setup automatically by running the download.sh script in the
../data directory.
Caution: Need Theano 0.6 or more recent.
(t)
bh = bh + Wuh u(t1)
(14.1)
(14.2)
(14.3)
149
The overall probability distribution is given by the sum over the T time steps in a given sequence:
P ({v (t) }) =
T
X
P (v (t) |A(t) )
(14.4)
t=1
where the right-hand side multiplicand is the marginalized probability of the tth RBM.
Note that for clarity of the implementation, contrarily to [BoulangerLewandowski12], we use the obvious
(t) for the recurrent hidden units.
naming convention for weight matrices and we use u(t) instead of h
14.2 Implementation
We wish to construct two Theano functions: one to train the RNN-RBM, and one to generate sample sequences from it.
(t)
(t)
For training, i.e. given {v (t) }, the RNN hidden state {u(t) } and the associated {bv , bh } parameters are
deterministic and can be readily computed for each training sequence. A stochastic gradient descent (SGD)
update on the parameters can then be estimated via contrastive divergence (CD) on the individual time steps
of a sequence in the same way that individual training examples are treated in a mini-batch for regular
RBMs.
Sequence generation is similar except that the v (t) must be sampled sequentially at each time step with a
separate (non-batch) Gibbs chain before being passed down to the recurrence and the sequence history.
150
Chapter 14. Modeling and generating sequences of polyphonic music with the RNN-RBM
14.2. Implementation
151
152
Chapter 14. Modeling and generating sequences of polyphonic music with the RNN-RBM
14.2. Implementation
153
dt=0.3
):
Constructs and compiles Theano functions for training and sequence
generation.
n_hidden : integer
Number of hidden units of the conditional RBMs.
n_hidden_recurrent : integer
Number of hidden units of the RNN.
lr : float
Learning rate
r : (integer, integer) tuple
Specifies the pitch range of the piano-roll in MIDI note numbers,
including r[0] but not r[1], such that r[1]-r[0] is the number of
visible units of the RBM at a given time step. The default (21,
109) corresponds to the full range of piano (88 notes).
dt : float
Sampling period when converting the MIDI files into piano-rolls, or
equivalently the time difference between consecutive time steps.
self.r = r
self.dt = dt
(v, v_sample, cost, monitor, params, updates_train, v_t,
updates_generate) = build_rnnrbm(
r[1] - r[0],
n_hidden,
n_hidden_recurrent
)
gradient = T.grad(cost, params, consider_constant=[v_sample])
updates_train.update(
((p, p - lr * g) for p, g in zip(params, gradient))
)
self.train_function = theano.function(
[v],
monitor,
updates=updates_train
)
self.generate_function = theano.function(
[],
v_t,
updates=updates_generate
)
def train(self, files, batch_size=100, num_epochs=200):
Train the RNN-RBM via stochastic gradient descent (SGD) using MIDI
files converted to piano-rolls.
files : list of strings
List of MIDI files that will be loaded as piano-rolls for training.
batch_size : integer
Training sequences will be split into subsequences of at most this
size before applying the SGD updates.
154
Chapter 14. Modeling and generating sequences of polyphonic music with the RNN-RBM
num_epochs : integer
Number of epochs (pass over the training set) performed. The user
can safely interrupt training with Ctrl+C at any time.
assert len(files) > 0, Training set is empty! \
(did you download the data files?)
dataset = [midiread(f, self.r,
self.dt).piano_roll.astype(theano.config.floatX)
for f in files]
try:
for epoch in xrange(num_epochs):
numpy.random.shuffle(dataset)
costs = []
for s, sequence in enumerate(dataset):
for i in xrange(0, len(sequence), batch_size):
cost = self.train_function(sequence[i:i + batch_size])
costs.append(cost)
print Epoch %i/%i % (epoch + 1, num_epochs),
print numpy.mean(costs)
sys.stdout.flush()
except KeyboardInterrupt:
print Interrupted by user.
def generate(self, filename, show=True):
Generate a sample sequence, plot the resulting piano-roll and save
it as a MIDI file.
filename : string
A MIDI file will be created at this location.
show : boolean
If True, a piano-roll of the generated sequence will be shown.
piano_roll = self.generate_function()
midiwrite(filename, piano_roll, self.r, self.dt)
if show:
extent = (0, self.dt * len(piano_roll)) + self.r
pylab.figure()
pylab.imshow(piano_roll.T, origin=lower, aspect=auto,
interpolation=nearest, cmap=pylab.cm.gray_r,
extent=extent)
pylab.xlabel(time (s))
pylab.ylabel(MIDI note number)
pylab.title(generated piano-roll)
14.3 Results
We ran the code on the Nottingham database for 200 epochs; training took approximately 24 hours.
14.3. Results
155
1/200 -15.0308940028
2/200 -10.4892606673
3/200 -10.2394696138
4/200 -10.1431669994
5/200 -9.7005382843
6/200 -8.5985647524
7/200 -8.35115428534
8/200 -8.26453580552
9/200 -8.21208991542
10/200 -8.16847274143
190/200
191/200
192/200
193/200
194/200
195/200
196/200
197/200
198/200
199/200
200/200
-4.74799179994
-4.73488515216
-4.7326138489
-4.73841636884
-4.70255511452
-4.71872634914
-4.7276415885
-4.73497644728
-inf
-4.75554987143
-4.72591935412
The figures below show the piano-rolls of two sample sequences and we provide the corresponding MIDI
files:
156
Chapter 14. Modeling and generating sequences of polyphonic music with the RNN-RBM
157
158
Chapter 14. Modeling and generating sequences of polyphonic music with the RNN-RBM
CHAPTER
FIFTEEN
MISCELLANEOUS
159
160
for i in xrange(4):
if X[i] is None:
# if channel is None, fill it with zeros of the correct
# dtype
out_array[:, :, i] = numpy.zeros(out_shape,
dtype=uint8 if output_pixel_vals else out_array.dtype
) + channel_defaults[i]
else:
# use a recurrent call to compute the channel and store it
# in the output
out_array[:, :, i] = tile_raster_images(X[i], img_shape, tile_shape, tile_spa
return out_array
else:
# if we are dealing with only one channel
H, W = img_shape
Hs, Ws = tile_spacing
# generate a matrix to store the output
out_array = numpy.zeros(out_shape, dtype=uint8 if output_pixel_vals else X.dtype)
161
162
CHAPTER
SIXTEEN
REFERENCES
163
164
BIBLIOGRAPHY
[Alder59] Alder, B. J. and Wainwright, T. E. (1959) Studies in molecular dynamics. 1. General method,
Journal of Chemical Physics, vol. 31, pp. 459-466.
[Andersen80] Andersen, H.C. (1980) Molecular dynamics simulations at constant pressure and/or temperature, Journal of Chemical Physics, vol. 72, pp. 2384-2393.
[Duane87] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987) Hybrid Monte Carlo,
Physics Letters, vol. 195, pp. 216-222.
[Neal93] Neal, R. M. (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 144 pages
[Bengio07] 25. Bengio, P. Lamblin, D. Popovici and H. Larochelle, Greedy Layer-Wise Training of Deep
Networks, in Advances in Neural Information Processing Systems 19 (NIPS06), pages 153-160,
MIT Press 2007.
[Bengio09] 25. Bengio, Learning deep architectures for AI, Foundations and Trends in Machine Learning
1(2) pages 1-127.
[BengioDelalleau09] 25. Bengio, O. Delalleau, Justifying and Generalizing Contrastive Divergence
(2009), Neural Computation, 21(6): 1601-1621.
[BoulangerLewandowski12] N Boulanger-Lewandowski, Y. Bengio and P. Vincent, Modeling Temporal
Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription, in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
[Fukushima] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193202.
[Hinton06] G.E. Hinton and R.R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, 28 July 2006, Vol. 313. no. 5786, pp. 504 - 507.
[Hinton07] G.E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural
Computation, vol 18, 2006
[Hubel68] Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey striate
cortex. Journal of Physiology (London), 195, 215243.
[LeCun98] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998d). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 22782324.
165
[Lee08]
8. Lee, C. Ekanadham, and A.Y. Ng., Sparse deep belief net model for visual area V2, in Advances in Neural Information Processing Systems (NIPS) 20, 2008.
[Lee09]
8. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations., ICML 2009
[Ranzato10] 13. Ranzato, A. Krizhevsky, G. Hinton, Factored 3-Way Restricted Boltzmann Machines for
Modeling Natural Images. Proc. of the 13-th International Conference on Artificial Intelligence
and Statistics (AISTATS 2010), Italy, 2010
[Ranzato07] M.A. Ranzato, C. Poultney, S. Chopra and Y. LeCun, in J. Platt et al., Efficient Learning
of Sparse Representations with an Energy-Based Model, Advances in Neural Information Processing
Systems (NIPS 2006), MIT Press, 2007.
[Serre07] Serre, T., Wolf, L., Bileschi, S., and Riesenhuber, M. (2007). Robust object recog- nition with
cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3), 411426. Member-Poggio,
Tomaso.
[Vincent08] 16. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust
Features with Denoising Autoencoders, Proceedings of the Twenty-fifth International Conference
on Machine Learning (ICML08), pages 1096 - 1103, ACM, 2008.
[Tieleman08] 20. Tieleman, Training restricted boltzmann machines using approximations to the likelihood gradient, ICML 2008.
[Xavier10] 25. Bengio, X. Glorot, Understanding the difficulty of training deep feedforward neuralnetworks, AISTATS 2010
166
Bibliography
INDEX
D
Dataset notation, 7
Datasets, 5
Download:, 5
E
Early-Stopping, 12
L
L1 and L2 regularization, 11
List of Symbols and acronyms, 8
Logistic Regression, 15
M
Math Convetions, 7
MNIST Dataset, 5
Multilayer Perceptron, 34
N
Negative LogLikelihood Loss, 9
Notation, 7
P
Python Namespaces, 8
R
Regularization, 11
S
Stochastic Gradient Descent, 10
T
Testing, 14
Z
Zero-One Loss, 8
167