Deep Learning Autoencoders
Deep Learning Autoencoders
Deep Learning
Autoencoders
Min-Te Sun, Ph.D.
Autoencoders
• Autoencoders are ar=ficial neural networks capable of learning
efficient representa=ons of the input data, called codings, without
any supervision (i.e., the training set is unlabeled).
• Applica=ons of autoencoders
• the coding typically has a much lower dimensionality than the input data,
making autoencoders useful for dimensionality reduc=on.
• autoencoders act as powerful feature detectors
• can be used for unsupervised pretraining of deep neural networks
• capable of randomly genera=ng new data that looks very similar to the
training data (i.e., a genera=ve model)
1
2019/5/1
2
2019/5/1
3
2019/5/1
Typical Autoencoders
• An autoencoder typically has the same architecture as a Multi-Layer
Perceptron, except that the number of neurons in the output layer must be
equal to the number of inputs.
• In this example, there is just one hidden layer composed of two neurons (the
encoder), and one output layer composed of three neurons (the decoder). The
outputs are often called the reconstructions since the autoencoder tries to
reconstruct the inputs, and the cost function contains a reconstruction loss that
penalizes the model when the reconstructions are different from the inputs.
• Because the internal representation has a lower dimensionality than the
input data (it is 2D instead of 3D), the autoencoder is said to be
undercomplete. An undercomplete autoencoder cannot trivially copy its
inputs to the codings, yet it must find a way to output a copy of its inputs.
It is forced to learn the most important features in the input data (and drop
the unimportant ones).
4
2019/5/1
Code Explanation
• Primary differences between autoencoder and MLP
• The number of outputs is equal to the number of inputs.
• To perform simple PCA, we set activation_fn=None (i.e., all neurons are linear)
and the cost function is the MSE.
5
2019/5/1
6
2019/5/1
Stacked Autoencoders
• Autoencoders can have multiple hidden layers. In this case they are called stacked
autoencoders (or deep autoencoders).
• Adding more layers helps the autoencoder learn more complex codings.
• However, one must be careful not to make the autoencoder too powerful. Imagine an
encoder so powerful that it just learns to map each input to a single arbitrary number (and
the decoder learns the reverse mapping). Obviously such an autoencoder will reconstruct the
training data perfectly, but it will not have learned any useful data representation in the
process (and it is unlikely to generalize well to new instances).
• The architecture of a stacked autoencoder is typically symmetrical with regards to
the central hidden layer (the coding layer). To put it simply, it looks like a
sandwich.
• For example, an autoencoder for MNIST (introduced in Chapter 3) may have 784 inputs,
followed by a hidden layer with 300 neurons, then a central hidden layer of 150 neurons,
then another hidden layer with 300 neurons, and an output layer with 784 neurons.
7
2019/5/1
8
2019/5/1
Execution Phase
n_epochs = 5
batch_size = 150
Tying Weights
• When an autoencoder is neatly symmetrical, like the one we just
built, a common technique is to 1e the weights of the decoder layers
to the weights of the encoder layers.
• This halves the number of weights in the model, speeding up training and
limi=ng the risk of overfiwng. Specifically, if the autoencoder has a total of N
layers (not coun=ng the input layer), and WL represents the connec=on
weights of the Lth layer (e.g., layer 1 is the first hidden layer, layer N/2 is the
coding layer, and layer N is the output layer), then the decoder layer weights
can be defined simply as: WN–L+1 = WLT (with L = 1, 2, ⋯, N/2 ).
• Unfortunately, implemen=ng =ed weights in TensorFlow using the
fully_connected() func=on is a bit cumbersome; it’s actually easier to
just define the layers manually.
9
2019/5/1
10
2019/5/1
Code Explana<on
1. weight3 and weights4 are not variables, they are respec=vely the
transpose of weights2 and weights1 (they are “=ed” to them).
2. Since they are not variables, it’s no use regularizing them: we only
regularize weights1 and weights2.
3. biases are never =ed, and never regularized.
11
2019/5/1
12
2019/5/1
13
2019/5/1
Code Explana<on
• The first phase is rather straighgorward: we just create an output
layer that skips hidden layers 2 and 3, then build the training
opera=ons to minimize the distance between the outputs and the
inputs (plus some regulariza=on).
• The second phase just adds the opera=ons needed to minimize the
distance between the output of hidden layer 3 and hidden layer 1
(also with some regulariza=on). Most importantly, we provide the list
of trainable variables to the minimize() method, making sure to leave
out weights1 and biases1; this effec=vely freezes hidden layer 1
during phase 2.
14
2019/5/1
Execu<on Phase
• During the execu=on phase, all you need to do is run the phase 1
training op for a number of epochs, then the phase 2 training op for
some more epochs.
• Since hidden layer 1 is frozen during phase 2, its output will always be
the same for any given training instance. To avoid having to
recompute the output of hidden layer 1 at every single epoch, you
can compute it for the whole training set at the end of phase 1, then
directly feed the cached output of hidden layer 1 during phase 2. This
can give you a nice performance boost.
15
2019/5/1
16
2019/5/1
for i in range(5):
plt.subplot(1, 5, i + 1)
plot_image(weights1_val.T[i])
17
2019/5/1
More Techniques
• The first four features seem to correspond to small patches, while the finh
feature seems to look for ver=cal strokes (note that these features come
from the stacked denoising autoencoder that we will discuss later).
• Another technique is to feed the autoencoder a random input image,
measure the ac=va=on of the neuron you are interested in, and then
perform backpropaga=on to tweak the image in such a way that the
neuron will ac=vate even more. If you iterate several =mes (performing
gradient ascent), the image will gradually turn into the most exci=ng image
(for the neuron). This is a useful technique to visualize the kinds of inputs
that a neuron is looking for.
• Finally, if you are using an autoencoder to perform unsupervised
pretraining—for example, for a classifica=on task—a simple way to verify
that the features learned by the autoencoder are useful is to measure the
performance of the classifier.
18
2019/5/1
19
2019/5/1
Overcomplete Autoencoder
• Up to now, in order to force the autoencoder to learn interes=ng
features, we have limited the size of the coding layer, making it
undercomplete. There are actually many other kinds of constraints
that can be used, including ones that allow the coding layer to be just
as large as the inputs, or even larger, resul=ng in an overcomplete
autoencoder => Denoising Autoencoders
Denoising Autoencoders
• Another way to force the autoencoder to learn useful features is to
add noise to its inputs, training it to recover the original, noise-free
inputs. This prevents the autoencoder from trivially copying its inputs
to its outputs, so it ends up having to find patterns in the data.
• The idea of using autoencoders to remove noise has been around since the
1980s (e.g., it is mentioned in Yann LeCun’s 1987 master’s thesis). In a 2008
paper, Pascal Vincent et al. showed that autoencoders could also be used for
feature extraction. In a 2010 paper, Vincent et al. introduced stacked
denoising autoencoders.
• The noise can be pure Gaussian noise added to the inputs, or it can be
randomly switched off inputs, just like in dropout.
20
2019/5/1
21
2019/5/1
keep_prob = 0.7
Execution Phase
• During training we must set is_training to True (as explained in
Chapter 11) using the feed_dict:
sess.run(training_op, feed_dict={X: X_batch, is_training: True})
22
2019/5/1
Sparse Autoencoders
• Another kind of constraint that onen leads to good feature extrac=on is
sparsity: by adding an appropriate term to the cost func=on, the
autoencoder is pushed to reduce the number of ac=ve neurons in the
coding layer.
• For example, it may be pushed to have on average only 5% significantly ac=ve
neurons in the coding layer. This forces the autoencoder to represent each input as a
combina=on of a small number of ac=va=ons. As a result, each neuron in the coding
layer typically ends up represen=ng a useful feature (if you could speak only a few
words per month, you would probably try to make them worth listening to).
• In order to favor sparse models, we must first measure the actual sparsity
of the coding layer at each training itera=on. We do so by compu=ng the
average ac=va=on of each neuron in the coding layer, over the whole
training batch. The batch size must not be too small, or else the mean will
not be accurate.
Sparsity Loss
• Once we have the mean ac=va=on per neuron, we want to penalize
the neurons that are too ac=ve by adding a sparsity loss to the cost
func=on.
• For example, if we measure that a neuron has an average ac=va=on of 0.3,
but the target sparsity is 0.1, it must be penalized to ac=vate less. One
approach could be simply adding the squared error (0.3 – 0.1) to the cost
func=on, but in prac=ce a beZer approach is to use the Kullback–Leibler
divergence (briefly discussed in Chapter 4), which has much stronger
gradients than the Mean Squared Error, as can be seen in the figure in the
next slide.
23
2019/5/1
Sparsity Loss
• In our case, we want to measure the divergence between the target probability p
that a neuron in the coding layer will activate, and the actual probability q (i.e.,
the mean activation over the training batch). So the KL divergence simplifies to
the following equation:
• Once we have computed the sparsity loss for each neuron in the coding layer, we
just sum up these losses, and add the result to the cost function. In order to
control the relative importance of the sparsity loss and the reconstruction loss,
we can multiply the sparsity loss by a sparsity weight hyperparameter. If this
weight is too high, the model will stick closely to the target sparsity, but it may
not reconstruct the inputs properly, making the model useless. Conversely, if it is
too low, the model will mostly ignore the sparsity objective and it will not learn
any interesting features.
24
2019/5/1
TensorFlow Implementa<on
def kl_divergence(p, q):
returnp*g.log(p/q)+(1-p)*g.log((1-p)/(1-q))
learning_rate = 0.01
sparsity_target = 0.1
sparsity_weight = 0.2
[...] # Build a normal autoencoder (in this example the coding layer is hidden1)
op=mizer = g.train.AdamOp=mizer(learning_rate)
hidden1_mean = g.reduce_mean(hidden1, axis=0) # batch mean
sparsity_loss = g.reduce_sum(kl_divergence(sparsity_target, hidden1_mean))
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
loss = reconstruc=on_loss + sparsity_weight * sparsity_loss
training_op = op=mizer.minimize(loss)
25
2019/5/1
Variational Autoencoders
• Introduced in 2014 by Diederik Kingma and Max Welling
• Differences from other autoencoders
• They are probabilistic autoencoders, meaning that their outputs are partly
determined by chance, even after training (as opposed to denoising
autoencoders, which use randomness only during training).
• Most importantly, they are generative autoencoders, meaning that they can
generate new instances that look like they were sampled from the training set.
• These properties make them rather similar to RBMs, but they are
easier to train and the sampling process is much faster
• With RBMs you need to wait for the network to stabilize into a “thermal
equilibrium” before you can sample a new instance
26
2019/5/1
More Explana<on
• As can be seen in the figure, although the inputs may have a very
convoluted distribu=on, a varia=onal autoencoder tends to produce
codings that look as though they were sampled from a simple Gaussian
distribu=on: during training, the cost func=on (discussed next) pushes the
codings to gradually migrate within the coding space (also called the latent
space) to occupy a roughly (hyper)spherical region that looks like a cloud of
Gaussian points.
• One great consequence is that aner training a varia=onal autoencoder, you
can very easily generate a new instance: just sample a random coding from
the Gaussian distribu=on, decode it, and voilà!
• Note that varia=onal autoencoders are actually more general; the codings
are not limited to Gaussian distribu=ons.
27
2019/5/1
28
2019/5/1
29
2019/5/1
30
2019/5/1
Other Autoencoders
• Contractive autoencoder (CAE) – The autoencoder is constrained during training
so that the derivatives of the codings with regards to the inputs are small. In
other words, two similar inputs must have similar codings.
• Stacked convolutional autoencoders – Autoencoders that learn to extract visual
features by reconstructing images processed through convolutional layers.
• Generative stochastic network (GSN) – A generalization of denoising
autoencoders, with the added capability to generate data.
• Winner-take-all (WTA) autoencoder – During training, after computing the
activations of all the neurons in the coding layer, only the top k% activations for
each neuron over the training batch are pre- served, and the rest are set to zero.
Naturally this leads to sparse codings. More- over, a similar WTA approach can be
used to produce sparse convolutional autoencoders.
• Adversarial autoencoders – One network is trained to reproduce its inputs, and at
the same time another is trained to find inputs that the first network is unable to
properly reconstruct. This pushes the first autoencoder to learn robust codings.
31