paper5
paper5
dimensionality low. The costly work of collecting more tion recognition. If the unsupervised learning model comes
labelled data and the tedious work of doing more clever en- up with useful representations then the classifier should be
gineering can go a long way in solving particular problems, able to perform better, especially when there are only a few
but this is ultimately unsatisfying as a machine learning labelled examples. We find that this is indeed the case.
solution. This highlights the need for using unsupervised
learning to find and represent structure in videos. More- 1.3. Related Work
over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well The first approaches to learning representations of videos
suited as a domain for building unsupervised learning mod- in an unsupervised way were based on ICA (van Hateren
els. & Ruderman, 1998; Hurri & Hyvärinen, 2003). Le et al.
(2011) approached this problem using multiple layers of
Independent Subspace Analysis modules. Generative mod-
1.2. Our Approach
els for understanding transformations between pairs of con-
When designing any unsupervised learning model, it is cru- secutive images are also well studied (Memisevic, 2013;
cial to have the right inductive biases and choose the right Memisevic & Hinton, 2010; Susskind et al., 2011). This
objective function so that the learning signal points the work was extended recently by Michalski et al. (2014) to
model towards learning useful features. In this paper, we model longer sequences.
use the LSTM Encoder-Decoder framework to learn video
Recently, Ranzato et al. (2014) proposed a generative
representations. The key inductive bias here is that the
model for videos. The model uses a recurrent neural
same operation must be applied at each time step to prop-
network to predict the next frame or interpolate between
agate information to the next step. This enforces the fact
frames. In this work, the authors highlight the importance
that the physics of the world remains the same, irrespec-
of choosing the right loss function. It is argued that squared
tive of input. The same physics acting on any state, at any
loss in input space is not the right objective because it does
time, must produce the next state. Our model works as
not respond well to small distortions in input space. The
follows. The Encoder LSTM runs through a sequence of
proposed solution is to quantize image patches into a large
frames to come up with a representation. This representa-
dictionary and train the model to predict the identity of
tion is then decoded through another LSTM to produce a
the target patch. This does solve some of the problems of
target sequence. We consider different choices of the tar-
squared loss but it introduces an arbitrary dictionary size
get sequence. One choice is to predict the same sequence
into the picture and altogether removes the idea of patches
as the input. The motivation is similar to that of autoen-
being similar or dissimilar to one other. Designing an ap-
coders – we wish to capture all that is needed to reproduce
propriate loss function that respects our notion of visual
the input but at the same time go through the inductive bi-
similarity is a very hard problem (in a sense, almost as hard
ases imposed by the model. Another option is to predict the
as the modeling problem we want to solve in the first place).
future frames. Here the motivation is to learn a representa-
Therefore, in this paper, we use the simple squared loss ob-
tion that extracts all that is needed to extrapolate the motion
jective function as a starting point and focus on designing
and appearance beyond what has been observed. These two
an encoder-decoder RNN architecture that can be used with
natural choices can also be combined. In this case, there are
any loss function.
two decoder LSTMs – one that decodes the representation
into the input sequence and another that decodes the same
representation to predict the future. 2. Model Description
The inputs to the model can, in principle, be any represen- In this section, we describe several variants of our LSTM
tation of individual video frames. However, for the pur- Encoder-Decoder model. The basic unit of our network
poses of this work, we limit our attention to two kinds of is the LSTM cell block. Our implementation of LSTMs
inputs. The first is image patches. For this we use natural follows closely the one discussed by Graves (2013).
image patches as well as a dataset of moving MNIST digits.
The second is high-level “percepts” extracted by applying a 2.1. Long Short Term Memory
convolutional net trained on ImageNet. These percepts are
the states of last (and/or second-to-last) layers of rectified In this section we briefly describe the LSTM unit which is
linear hidden states from a convolutional neural net model. the basic building block of our model. The unit is shown in
Fig. 1 (reproduced from Graves (2013)).
In order to evaluate the learned representations we quali-
tatively analyze the reconstructions and predictions made Each LSTM unit has a cell which has a state ct at time t.
by the model. For a more quantitative evaluation, we use This cell can be thought of as a memory unit. Access to
these LSTMs as initializations for the supervised task of ac- this memory unit for reading or modifying it is controlled
through sigmoidal gates – input gate it , forget gate ft and
Unsupervised Learning with LSTMs
W1 W1 copy W2 W2
v1 v2 v3 v3 v2
W1 W1 copy W2 W2 W2 W2
Learned
copy
Representation
v1 v2 v3 v4 v5 v3 v2
W1 W1
v1 v2 v3
learn trivial mappings for arbitrary length input sequences.
W3 W3
Second, the same LSTM operation is used to decode the
Sequence of Input Frames
representation recursively. This means that the same dy-
namics must be applied on the representation at any stage
of decoding. This further prevents the model from learning Future Prediction v4 v5
an identity mapping.
2.3. LSTM Future Predictor Model Figure 4. The Composite Model: The LSTM predicts the future
as well as the input sequence.
Another natural unsupervised learning task for sequences
is predicting the future. This is the approach used in lan-
target and hence a unimodal target distribution. But for the
guage models for modeling sequences of words. The de-
LSTM Future Predictor there is a possibility of multiple
sign of the Future Predictor Model is same as that of the
targets given an input because even if we assume a deter-
Autoencoder Model, except that the decoder LSTM in this
ministic universe, everything needed to predict the future
case predicts frames of the video that come after the in-
will not necessarily be observed in the input.
put sequence (Fig. 3). Ranzato et al. (2014) use a similar
model but predict only the next frame at each time step. There is also an argument against using a conditional
This model, on the other hand, predicts a long sequence decoder from the optimization point-of-view. There are
into the future. Here again we can consider two variants of strong short-range correlations in video data, for example,
the decoder – conditional and unconditioned. most of the content of a frame is same as the previous one.
If the decoder was given access to the last few frames while
Why should this learn good features?
generating a particular frame at training time, it would find
In order to predict the next few frames correctly, the model
it easy to pick up on these correlations. There would only
needs information about which objects and background are
be a very small gradient that tries to fix up the extremely
present and how they are moving so that the motion can
subtle errors that require long term knowledge about the
be extrapolated. The hidden state coming out from the en-
input sequence. In an unconditioned decoder, this input is
coder will try to capture this information. Therefore, this
removed and the model is forced to look for information
state can be seen as a representation of the input sequence.
deep inside the encoder.
2.4. Conditional Decoder
2.5. A Composite Model
For each of these two models, we can consider two possi-
The two tasks – reconstructing the input and predicting the
bilities - one in which the decoder LSTM is conditioned on
future can be combined to create a composite model as
the last generated frame and the other in which it is not. In
shown in Fig. 4. Here the encoder LSTM is asked to come
the experimental section, we explore these choices quanti-
up with a state from which we can both predict the next few
tatively. Here we briefly discuss arguments for and against
frames as well as reconstruct the input.
a conditional decoder. A strong argument in favour of using
a conditional decoder is that it allows the decoder to model This composite model tries to overcome the shortcomings
multiple modes in the target sequence distribution. With- that each model suffers on its own. A high-capacity au-
out that, we would end up averaging the multiple modes in toencoder would suffer from the tendency to learn trivial
the low-level input space. However, this is an issue only if representations that just memorize the inputs. However,
we expect multiple modes in the target sequence distribu- this memorization is not useful at all for predicting the fu-
tion. For the LSTM Autoencoder, there is only one correct ture. Therefore, the composite model cannot just memo-
Unsupervised Learning with LSTMs
rize information. On the other hand, the future predictor learning, and because we did not want to introduce any un-
suffers form the tendency to store information only about natural bias in the samples. We also used the supervised
the last few frames since those are most important for pre- datasets (UCF-101 and HMDB-51) for unsupervised train-
dicting the future, i.e., in order to predict vt , the frames ing. However, we found that using them did not give any
{vt−1 , . . . , vt−k } are much more important than v0 , for significant advantage over just using the YouTube videos.
some small value of k. Therefore the representation at the
We extracted percepts using the convolutional neural net
end of the encoder will have forgotten about a large part of
model of Simonyan & Zisserman (2014b). The videos
the input. But if we ask the model to also predict all of the
have a resolution of 240 × 320 and were sampled at al-
input sequence, then it cannot just pay attention to the last
most 30 frames per second. We took the central 224 × 224
few frames.
patch from each frame and ran it through the convnet. This
gave us the RGB percepts. Additionally, for UCF-101, we
3. Experiments computed flow percepts by extracting flows using the Brox
method and training the temporal stream convolutional net-
We design experiments to accomplish the following objec-
work as described by Simonyan & Zisserman (2014a). We
tives:
found that the fc6 features worked better than fc7 for sin-
• Get a qualitative understanding of what the LSTM gle frame classification using both RGB and flow percepts.
learns to do. Therefore, we used the 4096-dimensional fc6 layer as the
input representation of our data. Besides these percepts,
• Measure the benefit of initializing networks for super- we also trained the proposed models on 32 × 32 patches of
vised learning tasks with the weights found by unsu- pixels.
pervised learning, especially with very few training
All models were trained using backprop on a single
examples.
NVIDIA Titan GPU. A two layer 2048 unit Composite
• Compare the different proposed models - Autoen- model that predicts 13 frames and reconstructs 16 frames
coder, Future Predictor and Composite models and took 18-20 hours to converge on 300 hours of percepts. We
their conditional variants. initialized weights by sampling from a uniform distribution
whose scale was set to 1/sqrt(fan-in). Biases at all the gates
• Compare with state-of-the-art action recognition were initialized to zero. Peep-hole connections were ini-
benchmarks. tialized to zero. The supervised classifiers trained on 16
frames took 5-15 minutes to converge.
3.1. Datasets
3.2. Visualization and Qualitative Analysis
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012) The aim of this set of experiments to visualize the proper-
contains 13,320 videos with an average length of 6.2 sec- ties of the proposed models.
onds belonging to 101 different action categories. The Experiments on MNIST
dataset has 3 standard train/test splits with the training set We first trained our models on a dataset of moving MNIST
containing around 9,500 videos in each split (the rest are digits. In this dataset, each video was 20 frames long and
test). The HMDB-51 dataset (Kuehne et al., 2011) contains consisted of two digits moving inside a 64 × 64 patch.
5100 videos belonging to 51 different action categories. The digits were chosen randomly from the training set and
Mean length of the videos is 3.2 seconds. This also has placed initially at random locations inside the patch. Each
3 train/test splits with 3570 videos in the training set and digit was assigned a velocity whose direction was chosen
rest in test. uniformly randomly on a unit circle and whose magnitude
To train the unsupervised models, we used a subset of the was also chosen uniformly at random over a fixed range.
Sports-1M dataset (Karpathy et al., 2014), that contains The digits bounced-off the edges of the 64 × 64 frame and
1 million YouTube clips. Even though this dataset is la- overlapped if they were at the same location. The reason
belled for actions, we did not do any supervised experi- for working with this dataset is that it is infinite in size and
ments on it because of logistical constraints with working can be generated quickly on the fly. This makes it possi-
with such a huge dataset. We instead collected 300 hours ble to explore the model without expensive disk accesses
of video by randomly sampling 10 second clips from the or overfitting issues. It also has interesting behaviours due
dataset. It is possible to collect better samples if instead of to occlusions and the dynamics of bouncing off the walls.
choosing randomly, we extracted videos where a lot of mo- We first trained a single layer Composite Model. Each
tion is happening and where there are no shot boundaries. LSTM had 2048 units. The encoder took 10 frames as in-
However, we did not do so in the spirit of unsupervised
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -
Figure 5. Reconstruction and future prediction obtained from the Composite Model on a dataset of moving MNIST digits.
put. The decoder tried to reconstruct these 10 frames and 2048 units. We found that the reconstructions and the pre-
the future predictor attempted to predict the next 10 frames. dictions are both very blurry. We then trained a bigger
We used logistic output units with a cross entropy loss func- model with 4096 units. The outputs from this model are
tion. Fig. 5 shows two examples of running this model. also shown in Fig. 6. We can see that the reconstructions
The true sequences are shown in the first two rows. The get much sharper.
next two rows show the reconstruction and future predic-
Generalization over time scales
tion from the one layer Composite Model. It is interesting
In the next experiment, we test if the model can work
to note that the model figures out how to separate superim-
at time scales that are different than what it was trained
posed digits and can model them even as they pass through
on. We take a one hidden layer unconditioned Compos-
each other. This shows some evidence of disentangling the
ite Model trained on moving MNIST digits. The model
two independent factors of variation in this sequence. The
has 2048 LSTM units and looks at a 64 × 64 input. It
model can also correctly predict the motion after bounc-
was trained on input sequences of 10 frames to reconstruct
ing off the walls. In order to see if adding depth helps,
those 10 frames as well as predict 10 frames into the fu-
we trained a two layer Composite Model, with each layer
ture. In order to test if the future predictor is able to gen-
having 2048 units. We can see that adding depth helps the
eralize beyond 10 frames, we let the model run for 100
model make better predictions. Next, we changed the fu-
steps into the future. Fig. 7(a) shows the pattern of ac-
ture predictor by making it conditional. We can see that
tivity in the LSTM units of the future predictor pathway
this model makes sharper predictions.
for a randomly chosen test input. It shows the activity
Experiments on Natural Image Patches at each of the three sigmoidal gates (input, forget, out-
put), the input (after the tanh non-linearity, before being
multiplied by the input gate), the cell state and the final
Next, we tried to see if our models can also work with nat-
output (after being multiplied by the output gate). Even
ural image patches. For this, we trained the models on se-
though the units are ordered randomly along the vertical
quences of 32 × 32 natural image patches extracted from
axis, we can see that the dynamics has a periodic quality
the UCF-101 dataset. In this case, we used linear output
to it. The model is able to generate persistent motion for
units and the squared error loss function. The input was
long periods of time. In terms of reconstruction, the model
16 frames and the model was asked to reconstruct the 16
only outputs blobs after the first 15 frames, but the motion
frames and predict the future 13 frames. Fig. 6 shows the
is relatively well preserved. More results, including long
results obtained from a two layer Composite model with
range future predictions over hundreds of time steps can see
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -
Figure 6. Reconstruction and future prediction obtained from the Composite Model on a dataset of natural image patches. The first two
rows show ground truth sequences. The model takes 16 frames as inputs. Only the last 10 frames of the input sequence are shown here.
The next 13 frames are the ground truth future. In the rows that follow, we show the reconstructed and predicted frames for two instances
of the model.
been at http://www.cs.toronto.edu/˜nitish/ look like higher frequency strips. It is conceivable that the
unsupervised_video. To show that setting up a pe- high frequency features help in encoding the direction and
riodic behaviour is not trivial, Fig. 7(b) shows the activ- velocity of motion.
ity from a randomly initialized future predictor. Here, the
Fig. 10 shows the output features from the two LSTM de-
LSTM state quickly converges and the outputs blur com-
coders of a Composite Model. These correspond to the
pletely.
weights connecting the LSTM output units to the output
Out-of-domain Inputs layer. They appear to be somewhat qualitatively different
Next, we test this model’s ability to deal with out-of- from the input features shown in Fig. 9. There are many
domain inputs. For this, we test the model on sequences more output features that are local blobs, whereas those are
of one and three moving digits. The model was trained on rare in the input features. In the output features, the ones
sequences of two moving digits, so it has never seen in- that do look like strips are much shorter than those in the
puts with just one digit or three digits. Fig. 8 shows the input features. One way to interpret this is the following.
reconstruction and future prediction results. For one mov- The model needs to know about motion (which direction
ing digit, we can see that the model can do a good job but and how fast things are moving) from the input. This re-
it really tries to hallucinate a second digit overlapping with quires precise information about location (thin strips) and
the first one. The second digit shows up towards the end velocity (high frequency strips). But when it is generating
of the future reconstruction. For three digits, the model the output, the model wants to hedge its bets so that it does
merges digits into blobs. However, it does well at getting not suffer a huge loss for predicting things sharply at the
the overall motion right. This highlights a key drawback of wrong place. This could explain why the output features
modeling entire frames of input in a single pass. In order to have somewhat bigger blobs. The relative shortness of the
model videos with variable number of objects, we perhaps strips in the output features can be explained by the fact that
need models that not only have an attention mechanism in in the inputs, it does not hurt to have a longer feature than
place, but can also learn to execute themselves a variable what is needed to detect a location because information is
number of times and do variable amounts of computation. coarse-coded through multiple features. But in the output,
the model may not want to put down a feature that is bigger
Visualizing Features
than any digit because other units will have to conspire to
Next, we visualize the features learned by this model.
correct for it.
Fig. 9 shows the weights that connect each input frame to
the encoder LSTM. There are four sets of weights. One It is much harder to visualize the recurrent weights going
set of weights connects the frame to the input units. There from the outputs of the LSTM units into the gates at the
are three other sets, one corresponding to each of the three next time step. We are currently working on good ways to
gates (input, forget and output). Each weight has a size of get those visualizations.
64 × 64. A lot of features look like thin strips. Others
Unsupervised Learning with LSTMs
0 Input Gates 0 Forget Gates 0 Input 0 Output Gates 0 Cell States 0 Output
50 50 50 50 50 50
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Figure 8. Out-of-domain runs. Reconstruction and Future prediction for test sequences of one and three moving digits. The model was
trained on sequences of two moving digits.
Figure 9. Input features from a Composite Model trained on moving MNIST digits. In an LSTM, each input frame is connected to four
sets of units - the input, the input gate, forget gate and output gate. These figures show the top-200 features ordered by L2 norm of the
input features. The features in corresponding locations belong to the same LSTM unit.
Figure 10. Output features from the two decoder LSTMs of a Composite Model trained on moving MNIST digits. These figures show
the top-200 features ordered by L2 norm.
Unsupervised Learning with LSTMs
80 50
45
70
40
Classification Accuracy
Classification Accuracy
60 35
30
50
25
40 20
Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
explicitly computed flow features only. Models in the third HMDB-
Method UCF-101
set use both. 51
On RGB data, our model performs at par with the best deep Spatial Convolutional Net (Simonyan &
73.0 40.5
Zisserman, 2014a)
models. It performs 3% better than the LRCN model that C3D (Tran et al., 2014) 72.3 -
also used LSTMs on top of convnet features1 . Our model C3D + fc6 (Tran et al., 2014) 76.4 -
performs better than C3D features that use a 3D convolu- LRCN (Donahue et al., 2014) 71.1 -
tional net. However, when the C3D features are concate- Composite LSTM Model 75.8 44.0
nated with fc6 percepts, they do slightly better than our Temporal Convolutional Net (Simonyan &
83.7 54.6
model. Zisserman, 2014a)
LRCN (Donahue et al., 2014) 77.0 -
The improvement for flow features over using a randomly Composite LSTM Model 77.7 -
initialized LSTM network is quite small. We believe this is
LRCN (Donahue et al., 2014) 82.9 -
atleast partly due to the fact that the flow percepts already Two-stream Convolutional Net (Simonyan &
capture a lot of the motion information that the LSTM 88.0 59.4
Zisserman, 2014a)
would otherwise discover. Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
Composite LSTM Model 84.3 -
When we combine predictions from the RGB and flow
models, we obtain 84.3 accuracy on UCF-101. We believe Table 4. Comparison with state-of-the-art action recognition
further improvements can be made by running the model models.
over different patch locations and mirroring the patches.
Also, our model can be applied deeper inside the convnet 4. Conclusions
instead of just at the top-level. That can potentially lead to We proposed models based on LSTMs that can learn good
further improvements. In this paper, we focus on showing video representations. We compared them and analyzed
that unsupervised training helps consistently across both their properties through visualizations. Moreover, we man-
datasets and across different sized training sets. aged to get an improvement on supervised tasks. The best
1 performing model was the Composite Model that combined
However, the improvement is only partially from unsuper-
vised learning, since we used a better convnet model. an autoencoder and a future predictor. Conditioning on
generated outputs did not have a significant impact on the
Unsupervised Learning with LSTMs
performance for supervised tasks, however it made the fu- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T.
ture predictions look slightly better. The model was able to HMDB: a large video database for human motion recognition.
persistently generate motion well beyond the time scales it In Proceedings of the International Conference on Computer
Vision (ICCV), 2011.
was trained for. However, it lost the precise object features
rapidly after the training time scale. The features at the in- Lan, Zhen-Zhong, Lin, Ming, Li, Xuanchong, Hauptmann,
put and output layers were found to have some interesting Alexander G., and Raj, Bhiksha. Beyond gaussian pyramid:
Multi-skip feature stacking for action recognition. CoRR,
properties. abs/1411.6660, 2014.
To further get improvements for supervised tasks, we be- Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning hi-
lieve that the model can be extended by applying it convo- erarchical spatio-temporal features for action recognition with
lutionally across patches of the video and stacking multiple independent subspace analysis. In CVPR, 2011.
layers of such models. Applying this model in the lower Memisevic, Roland. Learning to relate images. IEEE Trans-
layers of a convolutional net could help extract motion in- actions on Pattern Analysis and Machine Intelligence, 35(8):
formation that would otherwise be lost across max-pooling 1829–1846, 2013.
layers. In our future work, we plan to build models based Memisevic, Roland and Hinton, Geoffrey E. Learning to represent
on these autoencoders from the bottom up instead of apply- spatial transformations with factored higher-order boltzmann
ing them only to percepts. machines. Neural Computation, 22(6):1473–1492, June 2010.