0% found this document useful (0 votes)
79 views

Deep Learning Autoencoders

The document summarizes autoencoders, which are artificial neural networks that learn efficient representations of input data without supervision. Autoencoders work by learning to copy their inputs to outputs while constraining the network, forcing it to discover patterns in the data. Stacked autoencoders can have multiple hidden layers to learn more complex representations. Code examples show how to implement linear and stacked autoencoders for dimensionality reduction and feature learning on datasets like MNIST.

Uploaded by

Q Q
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Deep Learning Autoencoders

The document summarizes autoencoders, which are artificial neural networks that learn efficient representations of input data without supervision. Autoencoders work by learning to copy their inputs to outputs while constraining the network, forcing it to discover patterns in the data. Stacked autoencoders can have multiple hidden layers to learn more complex representations. Code examples show how to implement linear and stacked autoencoders for dimensionality reduction and feature learning on datasets like MNIST.

Uploaded by

Q Q
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

2019/5/1

Deep Learning
Autoencoders
Min-Te Sun, Ph.D.

Autoencoders
• Autoencoders are ar=ficial neural networks capable of learning
efficient representa=ons of the input data, called codings, without
any supervision (i.e., the training set is unlabeled).
• Applica=ons of autoencoders
• the coding typically has a much lower dimensionality than the input data,
making autoencoders useful for dimensionality reduc=on.
• autoencoders act as powerful feature detectors
• can be used for unsupervised pretraining of deep neural networks
• capable of randomly genera=ng new data that looks very similar to the
training data (i.e., a genera=ve model)

1
2019/5/1

How does autoencoder work?


• Autoencoders work by simply learning to copy their inputs to their
outputs – this may sound like a trivial task, but we will see that
constraining the network in various ways can make it rather difficult.
• limit the size of the internal representation
• add noise to the inputs and train the network to recover the original inputs
• These constraints force autoencoders to learn efficient ways of
representing the data.
• the coding is the byproduct of the autoencoder’s attempt to learn the identity
function under some constraints

An Example of efficient data representations


• Which of the following number sequences do you find the easiest to
memorize?
• 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
• 50, 25, 76, 38, 19, 58, 29, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20
• Shorter => easier to memorize?
• In fact, for the long sequence, even numbers are followed by their half, and
odd numbers are followed by their triple plus one
• this is a famous sequence known as the hailstone sequence
• paZern => easier to memorize!
• Constraining an autoencoder during training pushes it to discover and
exploit paZerns in the data

2
2019/5/1

Rela<onship Between Memory, Percep<on,


and PaCern Matching
• William Chase and Herbert Simon found in the early 1970s that expert chess
players were able to memorize the posi=ons of all the pieces in a game by looking
at the board for just 5 seconds, a task that most people would find impossible.
• However, this was only the case when the pieces were placed in realis=c posi=ons (from
actual games), not when the pieces were placed randomly.
• Chess experts don’t have a much beZer memory than you and I, they just see chess paZerns
more easily thanks to their experience with the game. No=cing paZerns helps them store
informa=on efficiently.
• Similar to chess players, an autoencoder looks at the inputs, converts them to an
efficient internal representa=on, and then spits out something that looks very
close to the inputs.
• An autoencoder is always composed of two parts: an encoder (or recogni1on network) that
converts the inputs to an internal representa=on, followed by a decoder (or genera1ve
network) that converts the internal representa=on to the outputs.

The chess memory experiment (le2) vs a


simple autoencoder (right)

3
2019/5/1

Typical Autoencoders
• An autoencoder typically has the same architecture as a Multi-Layer
Perceptron, except that the number of neurons in the output layer must be
equal to the number of inputs.
• In this example, there is just one hidden layer composed of two neurons (the
encoder), and one output layer composed of three neurons (the decoder). The
outputs are often called the reconstructions since the autoencoder tries to
reconstruct the inputs, and the cost function contains a reconstruction loss that
penalizes the model when the reconstructions are different from the inputs.
• Because the internal representation has a lower dimensionality than the
input data (it is 2D instead of 3D), the autoencoder is said to be
undercomplete. An undercomplete autoencoder cannot trivially copy its
inputs to the codings, yet it must find a way to output a copy of its inputs.
It is forced to learn the most important features in the input data (and drop
the unimportant ones).

Performing PCA with an Undercomplete Linear


Autoencoder
• If the autoencoder uses only linear ac=va=ons and the cost func=on is
the Mean Squared Error (MSE), then it can be shown that it ends up
performing Principal Component Analysis.
• PCA -

4
2019/5/1

Code for Undercomplete Linear Autoencoder


import tensorflow as .
from tensorflow.contrib.layers import fully_connected
n_inputs = 3 # 3D inputs
n_hidden = 2 # 2D codings
n_outputs = n_inputs
learning_rate = 0.01
X = g.placeholder(g.float32, shape=[None, n_inputs])
hidden = fully_connected(X, n_hidden, ac=va=on_fn=None)
outputs = fully_connected(hidden, n_outputs, ac=va=on_fn=None)
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
op=mizer = g.train.AdamOp=mizer(learning_rate)
training_op = op=mizer.minimize(reconstruc=on_loss)
init = g.global_variables_ini=alizer()

Code Explanation
• Primary differences between autoencoder and MLP
• The number of outputs is equal to the number of inputs.
• To perform simple PCA, we set activation_fn=None (i.e., all neurons are linear)
and the cost function is the MSE.

5
2019/5/1

Code for Loading Dataset, Training via Training


Set, and Using Autoencoder
X_train, X_test = [...] # load the dataset
n_itera=ons = 1000
codings = hidden # the output of the hidden layer provides the codings
with g.Session() as sess:
init.run()
for itera=on in range(n_itera=ons):
training_op.run(feed_dict={X: X_train}) # no labels (unsupervised)
codings_val = codings.eval(feed_dict={X: X_test})
• The figure in next slide shows the original 3D dataset (at the len) and the
output of the autoencoder’s hidden layer (i.e., the coding layer, at the
right).
• The autoencoder found the best 2D plane to project the data onto, preserving as
much variance in the data as it could (just like PCA).

PCA Performed by an Undercomplete Linear


Autoencoder

6
2019/5/1

Stacked Autoencoders
• Autoencoders can have multiple hidden layers. In this case they are called stacked
autoencoders (or deep autoencoders).
• Adding more layers helps the autoencoder learn more complex codings.
• However, one must be careful not to make the autoencoder too powerful. Imagine an
encoder so powerful that it just learns to map each input to a single arbitrary number (and
the decoder learns the reverse mapping). Obviously such an autoencoder will reconstruct the
training data perfectly, but it will not have learned any useful data representation in the
process (and it is unlikely to generalize well to new instances).
• The architecture of a stacked autoencoder is typically symmetrical with regards to
the central hidden layer (the coding layer). To put it simply, it looks like a
sandwich.
• For example, an autoencoder for MNIST (introduced in Chapter 3) may have 784 inputs,
followed by a hidden layer with 300 neurons, then a central hidden layer of 150 neurons,
then another hidden layer with 300 neurons, and an output layer with 784 neurons.

Stacked Autoencoder for MNIST Dataset

7
2019/5/1

Code for MNIST Autoencoder


• A stacked autoencoder can be implemented like a regular deep MLP.
• In par=cular, the same techniques we used in Chapter 11 for training deep
nets can be applied.
• He ini=aliza=on, ELU ac=va=on func=on, and l2 regulariza=on.
• The code should look very familiar, except that there are no labels (no y):
n_inputs = 28 * 28 # for MNIST
n_hidden1 = 300
n_hidden2 = 150 # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.01
l2_reg = 0.001

Code for MNIST Autoencoder (Cont.)


X = g.placeholder(g.float32, shape=[None, n_inputs])
with g.contrib.framework.arg_scope(
[fully_connected],
ac=va=on_fn=g.nn.elu,
weights_ini=alizer=g.contrib.layers.variance_scaling_ini=alizer(),
weights_regularizer=g.contrib.layers.l2_regularizer(l2_reg)):
hidden1 = fully_connected(X, n_hidden1)
hidden2 = fully_connected(hidden1, n_hidden2) # codings
hidden3 = fully_connected(hidden2, n_hidden3)
outputs = fully_connected(hidden3, n_outputs, ac=va=on_fn=None)
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
reg_losses = g.get_collec=on(g.GraphKeys.REGULARIZATION_LOSSES)
loss = g.add_n([reconstruc=on_loss] + reg_losses)
op=mizer = g.train.AdamOp=mizer(learning_rate)
training_op = op=mizer.minimize(loss)
init = g.global_variables_ini=alizer()

8
2019/5/1

Execution Phase
n_epochs = 5
batch_size = 150

with tf.Session() as sess:


init.run()
for epoch in range(n_epochs):
n_batches = mnist.train.num_examples // batch_size
for iteration in range(n_batches):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch})

Tying Weights
• When an autoencoder is neatly symmetrical, like the one we just
built, a common technique is to 1e the weights of the decoder layers
to the weights of the encoder layers.
• This halves the number of weights in the model, speeding up training and
limi=ng the risk of overfiwng. Specifically, if the autoencoder has a total of N
layers (not coun=ng the input layer), and WL represents the connec=on
weights of the Lth layer (e.g., layer 1 is the first hidden layer, layer N/2 is the
coding layer, and layer N is the output layer), then the decoder layer weights
can be defined simply as: WN–L+1 = WLT (with L = 1, 2, ⋯, N/2 ).
• Unfortunately, implemen=ng =ed weights in TensorFlow using the
fully_connected() func=on is a bit cumbersome; it’s actually easier to
just define the layers manually.

9
2019/5/1

Code for Tying Weight


ac=va=on = g.nn.elu
regularizer = g.contrib.layers.l2_regularizer(l2_reg)
ini=alizer = g.contrib.layers.variance_scaling_ini=alizer()
X = g.placeholder(g.float32, shape=[None, n_inputs])
weights1_init = ini=alizer([n_inputs, n_hidden1])
weights2_init = ini=alizer([n_hidden1, n_hidden2])
weights1 = g.Variable(weights1_init, dtype=g.float32, name="weights1")
weights2 = g.Variable(weights2_init, dtype=g.float32, name="weights2")
weights3 = g.transpose(weights2, name="weights3") # 1ed weights
weights4 = g.transpose(weights1, name="weights4") # 1ed weights
biases1 = g.Variable(g.zeros(n_hidden1), name="biases1")
biases2 = g.Variable(g.zeros(n_hidden2), name="biases2")
biases3 = g.Variable(g.zeros(n_hidden3), name="biases3")
biases4 = g.Variable(g.zeros(n_outputs), name="biases4")

Code for Tying Weight (Cont.)


hidden1 = activation(tf.matmul(X, weights1) + biases1)
hidden2 = activation(tf.matmul(hidden1, weights2) + biases2)
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3)
outputs = tf.matmul(hidden3, weights4) + biases4
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))
reg_loss = regularizer(weights1) + regularizer(weights2)
loss = reconstruction_loss + reg_loss
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()

10
2019/5/1

Code Explana<on
1. weight3 and weights4 are not variables, they are respec=vely the
transpose of weights2 and weights1 (they are “=ed” to them).
2. Since they are not variables, it’s no use regularizing them: we only
regularize weights1 and weights2.
3. biases are never =ed, and never regularized.

Training One Autoencoder at a Time


• Rather than training the whole stacked autoencoder in one go like we
just did, it is onen much faster to train one shallow autoencoder at a
=me, then stack all of them into a single stacked autoencoder (hence
the name). This is especially useful for very deep autoencoders.
• During the first phase of training, the first autoencoder learns to reconstruct
the inputs. During the second phase, the second autoencoder learns to
reconstruct the output of the first autoencoder’s hidden layer. Finally, you just
build a big sandwich using all these autoencoders (i.e., you first stack the
hidden layers of each autoencoder, then the output layers in reverse order).
This gives you the final stacked autoencoder. You could easily train more
autoencoders this way, building a very deep stacked autoencoder.

11
2019/5/1

Training One Autoencoder at a Time (Using


Multiple Graphs)
• To implement this multiphase training algorithm, the simplest
approach is to use a different TensorFlow graph for each phase. After
training an autoencoder, you just run the training set through it and
capture the output of the hidden layer. This output then serves as the
training set for the next autoencoder. Once all autoencoders have
been trained this way, you simply copy the weights and biases from
each autoencoder and use them to build the stacked autoencoder.
Implementing this approach is quite straightforward, so we won’t
detail it here, but please check out the code in the Jupyter notebooks
for an example.

Training One Autoencoder At a Time

12
2019/5/1

A Single Graph to Train a Stacked Autoencoder

Training a Stacked Autoencoder by a Single


Graph
• Another approach is to use a single graph containing the whole stacked
autoencoder, plus some extra opera=ons to perform each training phase.
• The central column in the graph is the full stacked autoencoder. (This part can be
used aner training.)
• The len column is the set of opera=ons needed to run the first phase of training. It
creates an output layer that bypasses hidden layers 2 and 3. This output layer shares
the same weights and biases as the stacked autoencoder’s output layer. On top of
that are the training opera=ons that will aim at making the output as close as
possible to the inputs. Thus, this phase will train the weights and biases for the
hidden layer 1 and the output layer (i.e., the first autoencoder).
• The right column in the graph is the set of opera=ons needed to run the second
phase of training. It adds the training opera=on that will aim at making the output of
hidden layer 3 as close as possible to the output of hidden layer 1. Note that we
must freeze hidden layer 1 while running phase 2. This phase will train the weights
and biases for hidden layers 2 and 3 (i.e., the second autoencoder).

13
2019/5/1

Code for Phase 1 and Phase 2


[...] # Build the whole stacked autoencoder normally.
# In this example, the weights are not tied.
optimizer = tf.train.AdamOptimizer(learning_rate)
with tf.name_scope("phase1"):
phase1_outputs = tf.matmul(hidden1, weights4) + biases4
phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
phase1_training_op = optimizer.minimize(phase1_loss)
with tf.name_scope("phase2"):
phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
train_vars = [weights2, biases2, weights3, biases3]
phase2_training_op = optimizer.minimize(phase2_loss, var_list=train_vars)

Code Explana<on
• The first phase is rather straighgorward: we just create an output
layer that skips hidden layers 2 and 3, then build the training
opera=ons to minimize the distance between the outputs and the
inputs (plus some regulariza=on).
• The second phase just adds the opera=ons needed to minimize the
distance between the output of hidden layer 3 and hidden layer 1
(also with some regulariza=on). Most importantly, we provide the list
of trainable variables to the minimize() method, making sure to leave
out weights1 and biases1; this effec=vely freezes hidden layer 1
during phase 2.

14
2019/5/1

Execu<on Phase
• During the execu=on phase, all you need to do is run the phase 1
training op for a number of epochs, then the phase 2 training op for
some more epochs.
• Since hidden layer 1 is frozen during phase 2, its output will always be
the same for any given training instance. To avoid having to
recompute the output of hidden layer 1 at every single epoch, you
can compute it for the whole training set at the end of phase 1, then
directly feed the cached output of hidden layer 1 during phase 2. This
can give you a nice performance boost.

Visualizing the Reconstructions


• One way to ensure that an autoencoder is properly trained is to compare the inputs and
the outputs. They must be fairly similar, and the differences should be unimportant
details.
n_test_digits = 2
X_test = mnist.test.images[:n_test_digits]
with tf.Session() as sess:
[...] # Train the Autoencoder
outputs_val = outputs.eval(feed_dict={X: X_test})
def plot_image(image, shape=[28, 28]):
plt.imshow(image.reshape(shape), cmap="Greys", interpolation="nearest")
plt.axis("off")
for digit_index in range(n_test_digits):
plt.subplot(n_test_digits, 2, digit_index * 2 + 1)
plot_image(X_test[digit_index])
plt.subplot(n_test_digits, 2, digit_index * 2 + 2)
plot_image(outputs_val[digit_index])

15
2019/5/1

Original Digits (le2) and Their ReconstrucIons


(right)

Has Our Autoencoder Learned Useful


Features?
• Once your autoencoder has learned some features, you may want to
take a look at them.
• The simplest technique is to consider each neuron in every hidden layer, and
find the training instances that ac=vate it the most. This is especially useful
for the top hidden layers since they onen capture rela=vely large features that
you can easily spot in a group of training instances that contain them.
• For example, if a neuron strongly ac=vates when it sees a cat in a picture, it
will be preZy obvious that the pictures that ac=vate it the most all contain
cats. However, for lower layers, this technique does not work so well, as the
features are smaller and more abstract, so it’s onen hard to understand
exactly what the neuron is gewng all excited about.

16
2019/5/1

Alterna<ve Technique for Visualizing Features


• Another technique is that for each neuron in the first hidden layer,
you can create an image where a pixel’s intensity corresponds to the
weight of the connection to the given neuron.
• The following code plots the features learned by five neurons in the
first hidden layer:
with tf.Session() as sess:
[...] # train autoencoder
weights1_val = weights1.eval()

for i in range(5):
plt.subplot(1, 5, i + 1)
plot_image(weights1_val.T[i])

Features Learned by Five Neurons from The


First Hidden Layer

17
2019/5/1

More Techniques
• The first four features seem to correspond to small patches, while the finh
feature seems to look for ver=cal strokes (note that these features come
from the stacked denoising autoencoder that we will discuss later).
• Another technique is to feed the autoencoder a random input image,
measure the ac=va=on of the neuron you are interested in, and then
perform backpropaga=on to tweak the image in such a way that the
neuron will ac=vate even more. If you iterate several =mes (performing
gradient ascent), the image will gradually turn into the most exci=ng image
(for the neuron). This is a useful technique to visualize the kinds of inputs
that a neuron is looking for.
• Finally, if you are using an autoencoder to perform unsupervised
pretraining—for example, for a classifica=on task—a simple way to verify
that the features learned by the autoencoder are useful is to measure the
performance of the classifier.

Unsupervised Pretraining Using Stacked


Autoencoders
• As discussed in Chapter 11, if you are tackling a complex supervised task but you
do not have a lot of labeled training data, one solu=on is to find a neural network
that performs a similar task, and then reuse its lower layers. This makes it
possible to train a high-performance model using only liZle training data because
your neural network won’t have to learn all the low-level features; it will just
reuse the feature detectors learned by the exis=ng net.
• Similarly, if you have a large dataset but most of it is unlabeled, you can first train
a stacked autoencoder using all the data, then reuse the lower layers to create a
neural network for your actual task, and train it using the labeled data.
• For example, the figure in the next slide shows how to use a stacked autoencoder to perform
unsupervised pretraining for a classifica=on neural network. The stacked autoencoder itself is
typically trained one autoencoder at a =me, as discussed earlier. When training the classifier,
if you really don’t have much labeled training data, you may want to freeze the pretrained
layers (at least the lower ones).

18
2019/5/1

Unsupervised Pretraining Using Autoencoders

The Importance of Unsupervised Learning in


Deep Learning
• This situa=on is actually quite common, because building a large unlabeled
dataset is onen cheap (e.g., a simple script can download millions of images off
the internet), but labeling them can only be done reliably by humans (e.g.,
classifying images as cute or not). Labeling instances is =me-consuming and
costly, so it is quite common to have only a few thousand labeled instances.
• As we discussed earlier, one of the triggers of the current Deep Learning tsunami
is the discovery in 2006 by Geoffrey Hinton et al. that deep neural networks can
be pretrained in an unsupervised fashion. They used restricted Boltzmann
machines for that (see Appendix E), but in 2007 Yoshua Bengio et al. showed that
autoencoders worked just as well.
• There is nothing special about the TensorFlow implementa=on: just train an
autoencoder using all the training data, then reuse its encoder layers to create a
new neural network (see Chapter 11 for more details on how to reuse pretrained
layers, or check out the code examples in the Jupyter notebooks).

19
2019/5/1

Overcomplete Autoencoder
• Up to now, in order to force the autoencoder to learn interes=ng
features, we have limited the size of the coding layer, making it
undercomplete. There are actually many other kinds of constraints
that can be used, including ones that allow the coding layer to be just
as large as the inputs, or even larger, resul=ng in an overcomplete
autoencoder => Denoising Autoencoders

Denoising Autoencoders
• Another way to force the autoencoder to learn useful features is to
add noise to its inputs, training it to recover the original, noise-free
inputs. This prevents the autoencoder from trivially copying its inputs
to its outputs, so it ends up having to find patterns in the data.
• The idea of using autoencoders to remove noise has been around since the
1980s (e.g., it is mentioned in Yann LeCun’s 1987 master’s thesis). In a 2008
paper, Pascal Vincent et al. showed that autoencoders could also be used for
feature extraction. In a 2010 paper, Vincent et al. introduced stacked
denoising autoencoders.
• The noise can be pure Gaussian noise added to the inputs, or it can be
randomly switched off inputs, just like in dropout.

20
2019/5/1

Denoising Autoencoders, with Gaussian Noise


(left) or Dropout (right)

TensorFlow Implementa<on (Gaussian Noise)


X = g.placeholder(g.float32, shape=[None, n_inputs])
X_noisy = X + g.random_normal(g.shape(X))
[...]
hidden1 = ac=va=on(g.matmul(X_noisy, weights1) + biases1)
[...]
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
[...]

21
2019/5/1

TensorFlow Implementa<on (Dropout)


from tensorflow.contrib.layers import dropout

keep_prob = 0.7

is_training = g.placeholder_with_default(False, shape=(), name='is_training’)


X = g.placeholder(g.float32, shape=[None, n_inputs])
X_drop = dropout(X, keep_prob, is_training=is_training)
[...]
hidden1 = ac=va=on(g.matmul(X_drop, weights1) + biases1)
[...]
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
[...]

Execution Phase
• During training we must set is_training to True (as explained in
Chapter 11) using the feed_dict:
sess.run(training_op, feed_dict={X: X_batch, is_training: True})

• However, during testing it is not necessary to set is_training to False,


since we set that as the default in the call to the
placeholder_with_default() function.

22
2019/5/1

Sparse Autoencoders
• Another kind of constraint that onen leads to good feature extrac=on is
sparsity: by adding an appropriate term to the cost func=on, the
autoencoder is pushed to reduce the number of ac=ve neurons in the
coding layer.
• For example, it may be pushed to have on average only 5% significantly ac=ve
neurons in the coding layer. This forces the autoencoder to represent each input as a
combina=on of a small number of ac=va=ons. As a result, each neuron in the coding
layer typically ends up represen=ng a useful feature (if you could speak only a few
words per month, you would probably try to make them worth listening to).
• In order to favor sparse models, we must first measure the actual sparsity
of the coding layer at each training itera=on. We do so by compu=ng the
average ac=va=on of each neuron in the coding layer, over the whole
training batch. The batch size must not be too small, or else the mean will
not be accurate.

Sparsity Loss
• Once we have the mean ac=va=on per neuron, we want to penalize
the neurons that are too ac=ve by adding a sparsity loss to the cost
func=on.
• For example, if we measure that a neuron has an average ac=va=on of 0.3,
but the target sparsity is 0.1, it must be penalized to ac=vate less. One
approach could be simply adding the squared error (0.3 – 0.1) to the cost
func=on, but in prac=ce a beZer approach is to use the Kullback–Leibler
divergence (briefly discussed in Chapter 4), which has much stronger
gradients than the Mean Squared Error, as can be seen in the figure in the
next slide.

23
2019/5/1

Sparsity Loss

• Given two discrete probability distributions P and Q, the KL divergence between


these distributions, noted DKL(P ∥ Q), can be computed using the following
equation:

• In our case, we want to measure the divergence between the target probability p
that a neuron in the coding layer will activate, and the actual probability q (i.e.,
the mean activation over the training batch). So the KL divergence simplifies to
the following equation:

• Once we have computed the sparsity loss for each neuron in the coding layer, we
just sum up these losses, and add the result to the cost function. In order to
control the relative importance of the sparsity loss and the reconstruction loss,
we can multiply the sparsity loss by a sparsity weight hyperparameter. If this
weight is too high, the model will stick closely to the target sparsity, but it may
not reconstruct the inputs properly, making the model useless. Conversely, if it is
too low, the model will mostly ignore the sparsity objective and it will not learn
any interesting features.

24
2019/5/1

TensorFlow Implementa<on
def kl_divergence(p, q):
returnp*g.log(p/q)+(1-p)*g.log((1-p)/(1-q))
learning_rate = 0.01
sparsity_target = 0.1
sparsity_weight = 0.2
[...] # Build a normal autoencoder (in this example the coding layer is hidden1)
op=mizer = g.train.AdamOp=mizer(learning_rate)
hidden1_mean = g.reduce_mean(hidden1, axis=0) # batch mean
sparsity_loss = g.reduce_sum(kl_divergence(sparsity_target, hidden1_mean))
reconstruc=on_loss = g.reduce_mean(g.square(outputs - X)) # MSE
loss = reconstruc=on_loss + sparsity_weight * sparsity_loss
training_op = op=mizer.minimize(loss)

Comments for Code


• An important detail is the fact that the ac=va=ons of the coding layer must be between 0 and 1
(but not equal to 0 or 1), or else the KL divergence will return NaN (Not a Number). A simple
solu=on is to use the logis=c ac=va=on func=on for the coding layer:
hidden1 = g.nn.sigmoid(g.matmul(X, weights1) + biases1)
• One simple trick can speed up convergence: instead of using the MSE, we can choose a
reconstruc=on loss that will have larger gradients. Cross entropy is onen a good choice. To use it,
we must normalize the inputs to make them take on values from 0 to 1, and use the logis=c
ac=va=on func=on in the output layer so the outputs also take on values from 0 to 1.
TensorFlow’s sigmoid_cross_entropy_with_logits() func=on takes care of efficiently applying the
logis=c (sigmoid) ac=va=on func=on to the outputs and compu=ng the cross entropy:
[...]
logits = g.matmul(hidden1, weights2) + biases2)
outputs = g.nn.sigmoid(logits)
reconstruc=on_loss = g.reduce_sum(
g.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))
• Note that the outputs opera=on is not needed during training (we use it only when we want to
look at the reconstruc=ons).

25
2019/5/1

Variational Autoencoders
• Introduced in 2014 by Diederik Kingma and Max Welling
• Differences from other autoencoders
• They are probabilistic autoencoders, meaning that their outputs are partly
determined by chance, even after training (as opposed to denoising
autoencoders, which use randomness only during training).
• Most importantly, they are generative autoencoders, meaning that they can
generate new instances that look like they were sampled from the training set.
• These properties make them rather similar to RBMs, but they are
easier to train and the sampling process is much faster
• With RBMs you need to wait for the network to stabilize into a “thermal
equilibrium” before you can sample a new instance

How Varia<onal Autoencoders Work


• The figure in the next slide shows the varia=onal autoencoder which,
just like the basic structure of all autoencoders, has an encoder
followed by a decoder (in this example, they both have two hidden
layers). However…
• Instead of directly producing a coding for a given input, the encoder produces
a mean coding μ and a standard devia=on σ. The actual coding is then
sampled randomly from a Gaussian distribu=on with mean μ and standard
devia=on σ.
• Aner that the decoder just decodes the sampled coding normally. The right
part of the diagram shows a training instance going through this autoencoder.
First, the encoder produces μ and σ, then a coding is sampled randomly
(no=ce that it is not exactly located at μ), and finally this coding is decoded,
and the final output resembles the training instance.

26
2019/5/1

VariaIonal Autoencoder (le2), and An


Instance Going Through It (right)

More Explana<on
• As can be seen in the figure, although the inputs may have a very
convoluted distribu=on, a varia=onal autoencoder tends to produce
codings that look as though they were sampled from a simple Gaussian
distribu=on: during training, the cost func=on (discussed next) pushes the
codings to gradually migrate within the coding space (also called the latent
space) to occupy a roughly (hyper)spherical region that looks like a cloud of
Gaussian points.
• One great consequence is that aner training a varia=onal autoencoder, you
can very easily generate a new instance: just sample a random coding from
the Gaussian distribu=on, decode it, and voilà!
• Note that varia=onal autoencoders are actually more general; the codings
are not limited to Gaussian distribu=ons.

27
2019/5/1

The Cost Function


• The cost func=on is composed of two parts.
• The first is the usual reconstruc=on loss that pushes the autoencoder to
reproduce its inputs (we can use cross entropy for this, as discussed earlier).
• The second is the latent loss that pushes the autoencoder to have codings
that look as though they were sampled from a simple Gaussian distribu=on,
for which we use the KL divergence between the target distribu=on (the
Gaussian distribu=on) and the actual distribu=on of the codings.

Code for Latent Loss


• The math is a bit more complex than earlier, in par=cular because of the Gaussian
noise, which limits the amount of informa=on that can be transmiZed to the
coding layer (thus pushing the autoencoder to learn useful features). Luckily, the
equa=ons simplify to the following code for the latent loss:
eps = 1e-10 # smoothing term to avoid compu1ng log(0) which is NaN
latent_loss = 0.5 * g.reduce_sum(
g.square(hidden3_sigma) + g.square(hidden3_mean)
- 1 - g.log(eps + g.square(hidden3_sigma)))
• One common variant is to train the encoder to output γ = log(σ2) rather than σ.
Wherever we need σ we can just compute σ = exp(γ/2). This makes it a bit easier
for the encoder to capture sigmas of different scales, and thus it helps speed up
convergence. The latent loss ends up a bit simpler:
latent_loss = 0.5 * g.reduce_sum( g.exp(hidden3_gamma) +
g.square(hidden3_mean) - 1 - hidden3_gamma)

28
2019/5/1

Code for Varia<onal Autoencoder Using


log(σ2) Variant
n_inputs = 28 * 28 # for MNIST
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20 # codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.001
with g.contrib.framework.arg_scope(
[fully_connected],
ac=va=on_fn=g.nn.elu,
weights_ini=alizer=g.contrib.layers.variance_scaling_ini=alizer()):
X = g.placeholder(g.float32, [None, n_inputs])

Code for Variational Autoencoder Using log(σ2)


Variant (Cont.)
hidden1 = fully_connected(X, n_hidden1)
hidden2 = fully_connected(hidden1, n_hidden2)
hidden3_mean = fully_connected(hidden2, n_hidden3, ac=va=on_fn=None)
hidden3_gamma = fully_connected(hidden2, n_hidden3, ac=va=on_fn=None)
hidden3_sigma = g.exp(0.5 * hidden3_gamma)
noise = g.random_normal(g.shape(hidden3_sigma), dtype=g.float32)
hidden3 = hidden3_mean + hidden3_sigma * noise
hidden4 = fully_connected(hidden3, n_hidden4)
hidden5 = fully_connected(hidden4, n_hidden5)
logits = fully_connected(hidden5, n_outputs, ac=va=on_fn=None)
outputs = g.sigmoid(logits)
reconstruc=on_loss = g.reduce_sum(
g.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))
latent_loss = 0.5 * g.reduce_sum(
g.exp(hidden3_gamma) + g.square(hidden3_mean) - 1 - hidden3_gamma)
cost = reconstruc=on_loss + latent_loss
op=mizer = g.train.AdamOp=mizer(learning_rate=learning_rate)
training_op = op=mizer.minimize(cost)
init = g.global_variables_ini=alizer()

29
2019/5/1

Use This Varia<onal Autoencoder to Generate


Digits
import numpy as np
n_digits = 60
n_epochs = 50
batch_size = 150
with g.Session() as sess:
init.run()
for epoch in range(n_epochs):
n_batches = mnist.train.num_examples // batch_size
for itera=on in range(n_batches):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(training_op, feed_dict={X: X_batch})
codings_rnd = np.random.normal(size=[n_digits, n_hidden3])
outputs_val = outputs.eval(feed_dict={hidden3: codings_rnd})
• That’s it. The output of the following code is shown in the next slide.
for itera=on in range(n_digits):
plt.subplot(n_digits, 10, itera=on + 1)
plot_image(outputs_val[itera=on])

Images of HandwriPen Digits Generated by


VariaIonal Autoencoder (Only 30 Min Learning)

30
2019/5/1

Other Autoencoders
• Contractive autoencoder (CAE) – The autoencoder is constrained during training
so that the derivatives of the codings with regards to the inputs are small. In
other words, two similar inputs must have similar codings.
• Stacked convolutional autoencoders – Autoencoders that learn to extract visual
features by reconstructing images processed through convolutional layers.
• Generative stochastic network (GSN) – A generalization of denoising
autoencoders, with the added capability to generate data.
• Winner-take-all (WTA) autoencoder – During training, after computing the
activations of all the neurons in the coding layer, only the top k% activations for
each neuron over the training batch are pre- served, and the rest are set to zero.
Naturally this leads to sparse codings. More- over, a similar WTA approach can be
used to produce sparse convolutional autoencoders.
• Adversarial autoencoders – One network is trained to reproduce its inputs, and at
the same time another is trained to find inputs that the first network is unable to
properly reconstruct. This pushes the first autoencoder to learn robust codings.

31

You might also like