0% found this document useful (0 votes)

17 views12 pages

paper5

This paper presents a method for unsupervised learning of video representations using Long Short Term Memory (LSTM) networks. The authors propose an encoder-decoder framework where the encoder LSTM maps video sequences into fixed-length representations, which are then decoded to reconstruct the input or predict future frames. The model is evaluated on human action recognition tasks, demonstrating that learned representations improve classification accuracy, particularly with limited training data.

Uploaded by

c2711055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

paper5

Uploaded by

c2711055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava NITISH @ CS . TORONTO . EDU

Elman Mansimov EMANSIM @ CS . TORONTO . EDU
Ruslan Salakhutdinov RSALAKHU @ CS . TORONTO . EDU
University of Toronto, 6 Kings College Road, Toronto, ON M5S 3G4 CANADA
arXiv:1502.04681v2 [cs.LG] 31 Mar 2015

Abstract et al., 2014), and caption generation for images (Vinyals

We use multilayer Long Short Term Memory et al., 2014). They have also been applied on videos for
(LSTM) networks to learn representations of recognizing actions and generating natural language de-
video sequences. Our model uses an encoder scriptions (Donahue et al., 2014). A general sequence to
LSTM to map an input sequence into a fixed sequence learning framework was described by Sutskever
length representation. This representation is de- et al. (2014) in which a recurrent network is used to encode
coded using single or multiple decoder LSTMs a sequence into a fixed length representation, and then an-
to perform different tasks, such as reconstruct- other recurrent network is used to decode a sequence out of
ing the input sequence, or predicting the future that representation. In this work, we apply and extend this
sequence. We experiment with two kinds of framework to learn representations of sequences of images.
input sequences – patches of image pixels and We choose to work in the unsupervised setting where we
high-level representations (“percepts”) of video only have access to a dataset of unlabelled videos.
frames extracted using a pretrained convolutional Videos are an abundant and rich source of visual infor-
net. We explore different design choices such mation and can be seen as a window into the physics of
as whether the decoder LSTMs should condi- the world we live in, showing us examples of what con-
tion on the generated output. We analyze the stitutes objects, how objects move against backgrounds,
outputs of the model qualitatively to see how what happens when cameras move and how things get oc-
well the model can extrapolate the learned video cluded. Being able to learn a representation that disen-
representation into the future and into the past. tangles these factors would help in making intelligent ma-
We try to visualize and interpret the learned fea- chines that can understand and act in their environment.
tures. We stress test the model by running it on Additionally, learning good video representations is essen-
longer time scales and on out-of-domain data. tial for a number of useful tasks, such as recognizing ac-
We further evaluate the representations by fine- tions and gestures.
tuning them for a supervised learning problem –
human action recognition on the UCF-101 and 1.1. Why Unsupervised Learning?
HMDB-51 datasets. We show that the represen-
tations help improve classification accuracy, es- Supervised learning has been extremely successful in learn-
pecially when there are only a few training ex- ing good visual representations that not only produce good
amples. Even models pretrained on unrelated results at the task they are trained for, but also transfer well
datasets (300 hours of YouTube videos) can help to other tasks and datasets. Therefore, it is natural to ex-
action recognition performance. tend the same approach to learning video representations.
This has led to research in 3D convolutional nets (Ji et al.,
2013; Tran et al., 2014), different temporal fusion strategies
1. Introduction (Karpathy et al., 2014) and exploring different ways of pre-
senting visual information to convolutional nets (Simonyan
Understanding temporal sequences is important for solv-
& Zisserman, 2014a). However, videos are much higher di-
ing many problems in the AI-set. Recently, recurrent neu-
mensional entities compared to single images. Therefore, it
ral networks using the Long Short Term Memory (LSTM)
becomes increasingly difficult to do credit assignment and
architecture (Hochreiter & Schmidhuber, 1997) have been
learn long range structure, unless we collect much more
used successfully to perform various supervised sequence
labelled data or do a lot of feature engineering (for exam-
learning tasks, such as speech recognition (Graves & Jaitly,
ple computing the right kinds of flow features) to keep the
2014), machine translation (Sutskever et al., 2014; Cho
Unsupervised Learning with LSTMs

dimensionality low. The costly work of collecting more tion recognition. If the unsupervised learning model comes
labelled data and the tedious work of doing more clever en- up with useful representations then the classifier should be
gineering can go a long way in solving particular problems, able to perform better, especially when there are only a few
but this is ultimately unsatisfying as a machine learning labelled examples. We find that this is indeed the case.
solution. This highlights the need for using unsupervised
learning to find and represent structure in videos. More- 1.3. Related Work
over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well The first approaches to learning representations of videos
suited as a domain for building unsupervised learning mod- in an unsupervised way were based on ICA (van Hateren
els. & Ruderman, 1998; Hurri & Hyvärinen, 2003). Le et al.
(2011) approached this problem using multiple layers of
Independent Subspace Analysis modules. Generative mod-
1.2. Our Approach
els for understanding transformations between pairs of con-
When designing any unsupervised learning model, it is cru- secutive images are also well studied (Memisevic, 2013;
cial to have the right inductive biases and choose the right Memisevic & Hinton, 2010; Susskind et al., 2011). This
objective function so that the learning signal points the work was extended recently by Michalski et al. (2014) to
model towards learning useful features. In this paper, we model longer sequences.
use the LSTM Encoder-Decoder framework to learn video
Recently, Ranzato et al. (2014) proposed a generative
representations. The key inductive bias here is that the
model for videos. The model uses a recurrent neural
same operation must be applied at each time step to prop-
network to predict the next frame or interpolate between
agate information to the next step. This enforces the fact
frames. In this work, the authors highlight the importance
that the physics of the world remains the same, irrespec-
of choosing the right loss function. It is argued that squared
tive of input. The same physics acting on any state, at any
loss in input space is not the right objective because it does
time, must produce the next state. Our model works as
not respond well to small distortions in input space. The
follows. The Encoder LSTM runs through a sequence of
proposed solution is to quantize image patches into a large
frames to come up with a representation. This representa-
dictionary and train the model to predict the identity of
tion is then decoded through another LSTM to produce a
the target patch. This does solve some of the problems of
target sequence. We consider different choices of the tar-
squared loss but it introduces an arbitrary dictionary size
get sequence. One choice is to predict the same sequence
into the picture and altogether removes the idea of patches
as the input. The motivation is similar to that of autoen-
being similar or dissimilar to one other. Designing an ap-
coders – we wish to capture all that is needed to reproduce
propriate loss function that respects our notion of visual
the input but at the same time go through the inductive bi-
similarity is a very hard problem (in a sense, almost as hard
ases imposed by the model. Another option is to predict the
as the modeling problem we want to solve in the first place).
future frames. Here the motivation is to learn a representa-
Therefore, in this paper, we use the simple squared loss ob-
tion that extracts all that is needed to extrapolate the motion
jective function as a starting point and focus on designing
and appearance beyond what has been observed. These two
an encoder-decoder RNN architecture that can be used with
natural choices can also be combined. In this case, there are
any loss function.
two decoder LSTMs – one that decodes the representation
into the input sequence and another that decodes the same
representation to predict the future. 2. Model Description
The inputs to the model can, in principle, be any represen- In this section, we describe several variants of our LSTM
tation of individual video frames. However, for the pur- Encoder-Decoder model. The basic unit of our network
poses of this work, we limit our attention to two kinds of is the LSTM cell block. Our implementation of LSTMs
inputs. The first is image patches. For this we use natural follows closely the one discussed by Graves (2013).
image patches as well as a dataset of moving MNIST digits.
The second is high-level “percepts” extracted by applying a 2.1. Long Short Term Memory
convolutional net trained on ImageNet. These percepts are
the states of last (and/or second-to-last) layers of rectified In this section we briefly describe the LSTM unit which is
linear hidden states from a convolutional neural net model. the basic building block of our model. The unit is shown in
Fig. 1 (reproduced from Graves (2013)).
In order to evaluate the learned representations we quali-
tatively analyze the reconstructions and predictions made Each LSTM unit has a cell which has a state ct at time t.
by the model. For a more quantitative evaluation, we use This cell can be thought of as a memory unit. Access to
these LSTMs as initializations for the supervised task of ac- this memory unit for reading or modifying it is controlled
through sigmoidal gates – input gate it , forget gate ft and
Unsupervised Learning with LSTMs

Learned vˆ3 vˆ2 vˆ1

Representation

W1 W1 copy W2 W2

v1 v2 v3 v3 v2

Figure 2. LSTM Autoencoder Model

vised learning. The model consists of two RNNs – the en-

coder LSTM and the decoder LSTM as shown in Fig. 2.
Figure 2: Figure
Long Short-term Memory Cell
1. LSTM unit The input to the model is a sequence of vectors (image
patches or features). The encoder LSTM reads in this se-
output gate ot . [16],
(LSTM) architecture ThewhichLSTM unit operates
uses purpose-built memoryas cells
follows. At
to store infor- quence. After the last input has been read, the decoder
each
mation,time stepat itfinding
is better receives inputs long
and exploiting fromrange
twodependencies
external sources
in the data.
Fig. 2 illustrates a single LSTM memory cell. For the version of LSTM used in LSTM takes over and outputs a prediction for the target se-
at each of the four terminals (the three gates and the input).
this paper [7] H is implemented by the following composite function: quence. The target sequence is same as the input sequence,
The first source is the current frame xt . The second source
i = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (7) but in reverse order. Reversing the target sequence makes
is the previoust hidden states of all LSTM units in the same
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (8) the optimization easier because the model can get off the
layer ht−1 . Additionally, each gate has an internal source,(9)
ct = ft ct−1 + it tanh (Wxc xt + Whc ht−1 + bc ) ground by looking at low range correlations. This is also
the cell state oct t−1 = σ (Wxo xt + Who hblock.
of its cell The links between (10)
t−1 + Wco ct + bo )
a
inspired by how lists are represented in LISP. The encoder
cell and its own ht =gates aret ) called peephole connections. The
ot tanh(c (11)
can be seen as creating a list by applying the cons func-
inputs
where σ is coming from
the logistic different
sigmoid function,sources
and i, f , get
o andadded up, alongthe
c are respectively tion on the previously constructed list and the new input.
with a bias.
input gate, The output
forget gate, gatesgate, arecellactivated by activation
and cell input passing vectors,
their to-all of
which are the same size as the hidden vector h. The weight matrix subscripts The decoder essentially unrolls this list, with the hidden to
tal input through the logistic function. The total input at
have the obvious meaning, for example Whi is the hidden-input gate matrix, output weights extracting the element at the top of the list
the
Wxo input terminal isgate
is the input-output passed
matrixthrough
etc. Thethe tanh
weight non-linearity.
matrices from the cell
to gateresulting
vectors (e.g. Wci ) are diagonal, so elementby m in each gate vector ofonly
(car function) and the hidden to hidden weights extract-
The activation is multiplied the activation
receives input from element m of the cell vector. The bias terms (which are ing the rest of the list (cdr function). Therefore, the first
the input gate. This is then added
added to i, f , c and o) have been omitted for clarity. to the cell state after mul-
element out is the last element in.
The original
tiplying LSTM
the cell algorithm
state by theused forgeta custom
gate’sdesigned ft . The
approximate
activation gradi-
ent calculation that allowed the weights to be updated after every timestep [16].
final output from the LSTM unit
However the full gradient can instead be calculated h is computed by multi-
t with backpropagation through The decoder can be of two kinds – conditional or uncondi-
plying
time [11],thethe output
method used gate’s activation
in this paper. One otdifficulty
with the updated
when training cell
LSTM tioned. A conditional decoder receives the last generated
with the full gradient is that the derivatives sometimes become excessively large,
state passed through a tanh non-linearity. These updates output frame as input, i.e., the dotted input in Fig. 2 is
are summarized for a layer of 5LSTM units as follows present. An unconditioned decoder does not receive that
it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) , input. This is discussed in more detail in Sec. 2.4. Fig. 2
shows a single layer LSTM Autoencoder. The architecture
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) ,
can be extend to multiple layers by stacking LSTMs on top
ct = ft ct−1 + it tanh (Wxc xt + Whc ht−1 + bc ) , of each other.
ot = σ (Wxo xt + Who ht−1 + Wco ct + bo ) ,
Why should this learn good features?
ht = ot tanh(ct ). The state of the encoder LSTM after the last input has been
Note that all Wc• matrices are diagonal, whereas the rest read is the representation of the input video. The decoder
are dense. The key advantage of using an LSTM unit over LSTM is being asked to reconstruct back the input se-
a traditional neuron in an RNN is that the cell state in an quence from this representation. In order to do so, the rep-
LSTM unit sums activities over time. Since derivatives dis- resentation must retain information about the appearance
tribute over sums, the error derivatives don’t vanish quickly of the objects and the background as well as the motion
as they get sent back into time. This makes it easy to do contained in the video. However, an important question for
credit assignment over long sequences and discover long- any autoencoder-style model is what prevents it from learn-
range features. ing an identity mapping and effectively copying the input
to the output. In that case all the information about the in-
put would still be present but the representation will be no
2.2. LSTM Autoencoder Model
better than the input. There are two factors that control this
In this section, we describe a model that uses Recurrent behaviour. First, the fact that there are only a fixed num-
Neural Nets (RNNs) made of LSTM units to do unsuper- ber of hidden units makes it unlikely that the model can
Unsupervised Learning with LSTMs

Learned vˆ4 vˆ5 vˆ6 Input Reconstruction vˆ3 vˆ2 vˆ1

Representation

W1 W1 copy W2 W2 W2 W2

Learned
copy
Representation
v1 v2 v3 v4 v5 v3 v2

W1 W1

Figure 3. LSTM Future Predictor Model vˆ4 vˆ5 vˆ6

copy

v1 v2 v3
learn trivial mappings for arbitrary length input sequences.
W3 W3
Second, the same LSTM operation is used to decode the
Sequence of Input Frames
representation recursively. This means that the same dy-
namics must be applied on the representation at any stage
of decoding. This further prevents the model from learning Future Prediction v4 v5
an identity mapping.

2.3. LSTM Future Predictor Model Figure 4. The Composite Model: The LSTM predicts the future
as well as the input sequence.
Another natural unsupervised learning task for sequences
is predicting the future. This is the approach used in lan-
target and hence a unimodal target distribution. But for the
guage models for modeling sequences of words. The de-
LSTM Future Predictor there is a possibility of multiple
sign of the Future Predictor Model is same as that of the
targets given an input because even if we assume a deter-
Autoencoder Model, except that the decoder LSTM in this
ministic universe, everything needed to predict the future
case predicts frames of the video that come after the in-
will not necessarily be observed in the input.
put sequence (Fig. 3). Ranzato et al. (2014) use a similar
model but predict only the next frame at each time step. There is also an argument against using a conditional
This model, on the other hand, predicts a long sequence decoder from the optimization point-of-view. There are
into the future. Here again we can consider two variants of strong short-range correlations in video data, for example,
the decoder – conditional and unconditioned. most of the content of a frame is same as the previous one.
If the decoder was given access to the last few frames while
Why should this learn good features?
generating a particular frame at training time, it would find
In order to predict the next few frames correctly, the model
it easy to pick up on these correlations. There would only
needs information about which objects and background are
be a very small gradient that tries to fix up the extremely
present and how they are moving so that the motion can
subtle errors that require long term knowledge about the
be extrapolated. The hidden state coming out from the en-
input sequence. In an unconditioned decoder, this input is
coder will try to capture this information. Therefore, this
removed and the model is forced to look for information
state can be seen as a representation of the input sequence.
deep inside the encoder.
2.4. Conditional Decoder
2.5. A Composite Model
For each of these two models, we can consider two possi-
The two tasks – reconstructing the input and predicting the
bilities - one in which the decoder LSTM is conditioned on
future can be combined to create a composite model as
the last generated frame and the other in which it is not. In
shown in Fig. 4. Here the encoder LSTM is asked to come
the experimental section, we explore these choices quanti-
up with a state from which we can both predict the next few
tatively. Here we briefly discuss arguments for and against
frames as well as reconstruct the input.
a conditional decoder. A strong argument in favour of using
a conditional decoder is that it allows the decoder to model This composite model tries to overcome the shortcomings
multiple modes in the target sequence distribution. With- that each model suffers on its own. A high-capacity au-
out that, we would end up averaging the multiple modes in toencoder would suffer from the tendency to learn trivial
the low-level input space. However, this is an issue only if representations that just memorize the inputs. However,
we expect multiple modes in the target sequence distribu- this memorization is not useful at all for predicting the fu-
tion. For the LSTM Autoencoder, there is only one correct ture. Therefore, the composite model cannot just memo-
Unsupervised Learning with LSTMs

rize information. On the other hand, the future predictor learning, and because we did not want to introduce any un-
suffers form the tendency to store information only about natural bias in the samples. We also used the supervised
the last few frames since those are most important for pre- datasets (UCF-101 and HMDB-51) for unsupervised train-
dicting the future, i.e., in order to predict vt , the frames ing. However, we found that using them did not give any
{vt−1 , . . . , vt−k } are much more important than v0 , for significant advantage over just using the YouTube videos.
some small value of k. Therefore the representation at the
We extracted percepts using the convolutional neural net
end of the encoder will have forgotten about a large part of
model of Simonyan & Zisserman (2014b). The videos
the input. But if we ask the model to also predict all of the
have a resolution of 240 × 320 and were sampled at al-
input sequence, then it cannot just pay attention to the last
most 30 frames per second. We took the central 224 × 224
few frames.
patch from each frame and ran it through the convnet. This
gave us the RGB percepts. Additionally, for UCF-101, we
3. Experiments computed flow percepts by extracting flows using the Brox
method and training the temporal stream convolutional net-
We design experiments to accomplish the following objec-
work as described by Simonyan & Zisserman (2014a). We
tives:
found that the fc6 features worked better than fc7 for sin-
• Get a qualitative understanding of what the LSTM gle frame classification using both RGB and flow percepts.
learns to do. Therefore, we used the 4096-dimensional fc6 layer as the
input representation of our data. Besides these percepts,
• Measure the benefit of initializing networks for super- we also trained the proposed models on 32 × 32 patches of
vised learning tasks with the weights found by unsu- pixels.
pervised learning, especially with very few training
All models were trained using backprop on a single
examples.
NVIDIA Titan GPU. A two layer 2048 unit Composite
• Compare the different proposed models - Autoen- model that predicts 13 frames and reconstructs 16 frames
coder, Future Predictor and Composite models and took 18-20 hours to converge on 300 hours of percepts. We
their conditional variants. initialized weights by sampling from a uniform distribution
whose scale was set to 1/sqrt(fan-in). Biases at all the gates
• Compare with state-of-the-art action recognition were initialized to zero. Peep-hole connections were ini-
benchmarks. tialized to zero. The supervised classifiers trained on 16
frames took 5-15 minutes to converge.
3.1. Datasets
3.2. Visualization and Qualitative Analysis
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012) The aim of this set of experiments to visualize the proper-
contains 13,320 videos with an average length of 6.2 sec- ties of the proposed models.
onds belonging to 101 different action categories. The Experiments on MNIST
dataset has 3 standard train/test splits with the training set We first trained our models on a dataset of moving MNIST
containing around 9,500 videos in each split (the rest are digits. In this dataset, each video was 20 frames long and
test). The HMDB-51 dataset (Kuehne et al., 2011) contains consisted of two digits moving inside a 64 × 64 patch.
5100 videos belonging to 51 different action categories. The digits were chosen randomly from the training set and
Mean length of the videos is 3.2 seconds. This also has placed initially at random locations inside the patch. Each
3 train/test splits with 3570 videos in the training set and digit was assigned a velocity whose direction was chosen
rest in test. uniformly randomly on a unit circle and whose magnitude
To train the unsupervised models, we used a subset of the was also chosen uniformly at random over a fixed range.
Sports-1M dataset (Karpathy et al., 2014), that contains The digits bounced-off the edges of the 64 × 64 frame and
1 million YouTube clips. Even though this dataset is la- overlapped if they were at the same location. The reason
belled for actions, we did not do any supervised experi- for working with this dataset is that it is infinite in size and
ments on it because of logistical constraints with working can be generated quickly on the fly. This makes it possi-
with such a huge dataset. We instead collected 300 hours ble to explore the model without expensive disk accesses
of video by randomly sampling 10 second clips from the or overfitting issues. It also has interesting behaviours due
dataset. It is possible to collect better samples if instead of to occlusions and the dynamics of bouncing off the walls.
choosing randomly, we extracted videos where a lot of mo- We first trained a single layer Composite Model. Each
tion is happening and where there are no shot boundaries. LSTM had 2048 units. The encoder took 10 frames as in-
However, we did not do so in the spirit of unsupervised
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -

Input Reconstruction - Future Prediction -

One Layer Composite Model

Two Layer Composite Model

Two Layer Composite Model with a Conditional Future Predictor

Figure 5. Reconstruction and future prediction obtained from the Composite Model on a dataset of moving MNIST digits.
put. The decoder tried to reconstruct these 10 frames and 2048 units. We found that the reconstructions and the pre-
the future predictor attempted to predict the next 10 frames. dictions are both very blurry. We then trained a bigger
We used logistic output units with a cross entropy loss func- model with 4096 units. The outputs from this model are
tion. Fig. 5 shows two examples of running this model. also shown in Fig. 6. We can see that the reconstructions
The true sequences are shown in the first two rows. The get much sharper.
next two rows show the reconstruction and future predic-
Generalization over time scales
tion from the one layer Composite Model. It is interesting
In the next experiment, we test if the model can work
to note that the model figures out how to separate superim-
at time scales that are different than what it was trained
posed digits and can model them even as they pass through
on. We take a one hidden layer unconditioned Compos-
each other. This shows some evidence of disentangling the
ite Model trained on moving MNIST digits. The model
two independent factors of variation in this sequence. The
has 2048 LSTM units and looks at a 64 × 64 input. It
model can also correctly predict the motion after bounc-
was trained on input sequences of 10 frames to reconstruct
ing off the walls. In order to see if adding depth helps,
those 10 frames as well as predict 10 frames into the fu-
we trained a two layer Composite Model, with each layer
ture. In order to test if the future predictor is able to gen-
having 2048 units. We can see that adding depth helps the
eralize beyond 10 frames, we let the model run for 100
model make better predictions. Next, we changed the fu-
steps into the future. Fig. 7(a) shows the pattern of ac-
ture predictor by making it conditional. We can see that
tivity in the LSTM units of the future predictor pathway
this model makes sharper predictions.
for a randomly chosen test input. It shows the activity
Experiments on Natural Image Patches at each of the three sigmoidal gates (input, forget, out-
put), the input (after the tanh non-linearity, before being
multiplied by the input gate), the cell state and the final
Next, we tried to see if our models can also work with nat-
output (after being multiplied by the output gate). Even
ural image patches. For this, we trained the models on se-
though the units are ordered randomly along the vertical
quences of 32 × 32 natural image patches extracted from
axis, we can see that the dynamics has a periodic quality
the UCF-101 dataset. In this case, we used linear output
to it. The model is able to generate persistent motion for
units and the squared error loss function. The input was
long periods of time. In terms of reconstruction, the model
16 frames and the model was asked to reconstruct the 16
only outputs blobs after the first 15 frames, but the motion
frames and predict the future 13 frames. Fig. 6 shows the
is relatively well preserved. More results, including long
results obtained from a two layer Composite model with
range future predictions over hundreds of time steps can see
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -

Input Reconstruction - Future Prediction -

Two Layer Composite Model with 2048 LSTM units

Two Layer Composite Model with 4096 LSTM units

Figure 6. Reconstruction and future prediction obtained from the Composite Model on a dataset of natural image patches. The first two
rows show ground truth sequences. The model takes 16 frames as inputs. Only the last 10 frames of the input sequence are shown here.
The next 13 frames are the ground truth future. In the rows that follow, we show the reconstructed and predicted frames for two instances
of the model.
been at http://www.cs.toronto.edu/˜nitish/ look like higher frequency strips. It is conceivable that the
unsupervised_video. To show that setting up a pe- high frequency features help in encoding the direction and
riodic behaviour is not trivial, Fig. 7(b) shows the activ- velocity of motion.
ity from a randomly initialized future predictor. Here, the
Fig. 10 shows the output features from the two LSTM de-
LSTM state quickly converges and the outputs blur com-
coders of a Composite Model. These correspond to the
pletely.
weights connecting the LSTM output units to the output
Out-of-domain Inputs layer. They appear to be somewhat qualitatively different
Next, we test this model’s ability to deal with out-of- from the input features shown in Fig. 9. There are many
domain inputs. For this, we test the model on sequences more output features that are local blobs, whereas those are
of one and three moving digits. The model was trained on rare in the input features. In the output features, the ones
sequences of two moving digits, so it has never seen in- that do look like strips are much shorter than those in the
puts with just one digit or three digits. Fig. 8 shows the input features. One way to interpret this is the following.
reconstruction and future prediction results. For one mov- The model needs to know about motion (which direction
ing digit, we can see that the model can do a good job but and how fast things are moving) from the input. This re-
it really tries to hallucinate a second digit overlapping with quires precise information about location (thin strips) and
the first one. The second digit shows up towards the end velocity (high frequency strips). But when it is generating
of the future reconstruction. For three digits, the model the output, the model wants to hedge its bets so that it does
merges digits into blobs. However, it does well at getting not suffer a huge loss for predicting things sharply at the
the overall motion right. This highlights a key drawback of wrong place. This could explain why the output features
modeling entire frames of input in a single pass. In order to have somewhat bigger blobs. The relative shortness of the
model videos with variable number of objects, we perhaps strips in the output features can be explained by the fact that
need models that not only have an attention mechanism in in the inputs, it does not hurt to have a longer feature than
place, but can also learn to execute themselves a variable what is needed to detect a location because information is
number of times and do variable amounts of computation. coarse-coded through multiple features. But in the output,
the model may not want to put down a feature that is bigger
Visualizing Features
than any digit because other units will have to conspire to
Next, we visualize the features learned by this model.
correct for it.
Fig. 9 shows the weights that connect each input frame to
the encoder LSTM. There are four sets of weights. One It is much harder to visualize the recurrent weights going
set of weights connects the frame to the input units. There from the outputs of the LSTM units into the gates at the
are three other sets, one corresponding to each of the three next time step. We are currently working on good ways to
gates (input, forget and output). Each weight has a size of get those visualizations.
64 × 64. A lot of features look like thin strips. Others
Unsupervised Learning with LSTMs

(a) Trained Future Predictor

0 Input Gates 0 Forget Gates 0 Input 0 Output Gates 0 Cell States 0 Output

50 50 50 50 50 50

100 100 100 100 100 100

150 150 150 150 150 150

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

(b) Randomly Initialized Future Predictor

Figure 7. Pattern of activity in 200 randomly chosen LSTM units in the Future Predictor of a 1 layer (unconditioned) Composite Model
trained on moving MNIST digits. The vertical axis corresponds to different LSTM units. The horizontal axis is time. The model was
only trained to predict the next 10 frames, but here we let it run to predict the next 100 frames. Top: The dynamics has a periodic quality
which does not die out. Bottom : The pattern of activity, if the trained weights in the future predictor are replaced by random weights.
The dynamics quickly dies out.
3.3. Action Recognition on UCF-101/HMDB-51 The baseline for comparing these models is an identical
LSTM classifier but with randomly initialized weights. All
The aim of this set of experiments is to see if the features
classifiers used dropout regularization, where we dropped
learned by unsupervised learning can help improve perfor-
activations as they were communicated across layers but
mance on supervised tasks.
not through time within the same LSTM as proposed in
We trained a two layer Composite Model with 2048 hid- Zaremba et al. (2014). We emphasize that this is a very
den units with no conditioning on either decoders. The strong baseline and does significantly better than just using
model was trained on percepts extracted from 300 hours single frames. Using dropout was crucial in order to train
of YouTube data. The model was trained to autoencode good baseline models especially with very few training ex-
16 frames and predict the next 13 frames. We initialize an amples.
LSTM classifier with the weights learned by the encoder
Fig. 12 compares three models - single frame classifier
LSTM from this model. The classifier is shown in Fig. 11.
(logistic regression), baseline LSTM classifier and the
The output from each LSTM in the second layer goes into a
LSTM classifier initialized with weights from the Com-
softmax classifier that makes a prediction about the action
posite Model as the number of labelled videos per class is
being performed at each time step. Since only one action is
varied. Note that having one labelled video means having
being performed in each video in the datasets we consider,
many labelled 16 frame blocks. We can see that for the case
the target is the same at each time step. At test time, the
of very few training examples, unsupervised learning gives
predictions made at each time step are averaged. To get a
a substantial improvement. For example, for UCF-101, the
prediction for the entire video, we average the predictions
performance improves from 29.6% to 34.3% when train-
from all 16 frame blocks in the video with a stride of 8
ing on only one labelled video. As the size of the labelled
frames. Using a smaller stride did not improve results.
dataset grows, the improvement becomes smaller. Even for
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -

Input Reconstruction - Future Prediction -

Input Input Gates

Figure 8. Out-of-domain runs. Reconstruction and Future prediction for test sequences of one and three moving digits. The model was
trained on sequences of two moving digits.

Forget Gates Output Gates

(a) Inputs (b) Input Gates

(c) Forget Gates (d) Output Gates

Figure 9. Input features from a Composite Model trained on moving MNIST digits. In an LSTM, each input frame is connected to four
sets of units - the input, the input gate, forget gate and output gate. These figures show the top-200 features ordered by L2 norm of the
input features. The features in corresponding locations belong to the same LSTM unit.

(a) Input Reconstruction (b) Future Prediction

Figure 10. Output features from the two decoder LSTMs of a Composite Model trained on moving MNIST digits. These figures show
the top-200 features ordered by L2 norm.
Unsupervised Learning with LSTMs

y1 y2 ... yT Squared loss

Cross Entropy
Model on image
on MNIST
patches
Future Predictor 350.2 225.2
W (2) W (2) ... W (2) Composite Model 344.9 210.7
Conditional Future Predictor 343.5 221.3
Composite Model with
341.2 208.1
Conditional Future Predictor
(1) (1) (1)
W W ... W
Table 2. Future prediction results on MNIST and image patches.
All models use 2 layers of LSTMs.

we can use the error in predicting the future as a reasonable

v1 v2 ... vT measure of how good the model is doing. Besides, we can
use the performance on supervised tasks as a proxy for how
good the unsupervised model is doing. In this section, we
Figure 11. LSTM Classifier. present results from these two analyses.
UCF-101 UCF-101 HMDB-51 Future prediction results are summarized in Table 2. For
Model
RGB 1- frame flow RGB MNIST we compute the cross entropy of the predictions
Single Frame 72.2 72.2 40.1 with respect to the ground truth, both of which are 64 ×
LSTM classifier 74.5 74.3 42.8 64 patches. For natural image patches, we compute the
Composite LSTM
75.8 74.6 44.1 squared loss. We see that the Composite Model always
Model + Finetuning
does a better job of predicting the future compared to the
Table 1. Summary of Results on Action Recognition. Future Predictor. This indicates that having the autoen-
coder along with the future predictor to force the model
the full UCF-101 dataset we still get a considerable im- to remember more about the inputs actually helps predict
provement from 74.5% to 75.8%. On HMDB-51, the im- the future better. Next, we can compare each model with
provement is from 42.8% to 44.0% for the full dataset (70 its conditional variant. Here, we find that the conditional
videos per class) and 14.4% to 19.1% for one video per models perform better, as was also noted in Fig. 5.
class. Although, the improvement in classification by us-
Next, we compare the models using performance on a su-
ing unsupervised learning was not as big as we expected,
pervised task. Table 3 shows the performance on action
we still managed to yield an additional improvement over
recognition achieved by finetuning different unsupervised
a strong baseline. We discuss some avenues for improve-
learning models. Besides running the experiments on the
ments later.
full UCF-101 and HMDB-51 datasets, we also ran the ex-
We further ran similar experiments on the optical flow data periments on small subsets of these to better highlight the
extracted from the UCF-101 dataset. A Temporal stream case where we have very few training examples. We find
convolutional net, similar to the one proposed by Simonyan that all unsupervised models improve over the baseline
& Zisserman (2014b), was trained on single frame optical LSTM which is itself well-regularized by using dropout.
flows and stacks of 10 optical flows. This gave an accuracy The Autoencoder model seems to perform consistently bet-
of 72.2% and 77.5% respectively. LSTMs with 128 hid- ter than the Future Predictor. The Composite model which
den units trained on sequences of 4 frames improved the combines the two does better than either one alone. Con-
accuracy by 2.1% to 74.3% for the single frame case. Big- ditioning on the generated inputs does not seem to give a
ger LSTMs with more frames did not improve results. By clear advantage over not doing so. The Composite Model
pretraining the LSTM, we were able to further improve the with a conditional future predictor works the best, although
classification to 74.6%. Similarly for stacks of 10 frames its performance is almost same as that of the Composite
we improved slightly to 77.7%. These results are summa- Model.
rized in Table 1.
3.5. Comparison with Other Action Recognition
3.4. Comparison of Different Model Variants Benchmarks
The aim of this set of experiments is to compare the dif- Finally, we compare our models to the state-of-the-art ac-
ferent variants of the model proposed in this paper. Since tion recognition results. The performance is summarized in
it is always possible to get lower reconstruction error by Table 4. The table is divided into three sets. The first set
copying the inputs, we cannot use input reconstruction er- compares models that use only RGB data (single or mul-
ror as a measure of how good a model is doing. However, tiple frames). The second set compares models that use
Unsupervised Learning with LSTMs

80 50

45
70
40

Classification Accuracy
Classification Accuracy
60 35

30
50
25

40 20

Single Frame 15 Single Frame

30 LSTM LSTM
10
LSTM + Pretraining LSTM + Pretraining
20 5
1 2 4 10 20 50 100 1 2 4 8 16 32 64
Training Examples per class Training Examples per class

(a) UCF-101 RGB (b) HMDB-51 RGB

Figure 12. Effect of pretraining on action recognition with change in the size of the labelled training set. The error bars are over 10
different samples of training sets.
Method UCF-101 small UCF-101 HMDB-51 small HMDB-51
Baseline LSTM 63.7 74.5 25.3 42.8
Autoencoder 66.2 75.1 28.6 44.0
Future Predictor 64.9 74.9 27.3 43.1
Conditional Autoencoder 65.8 74.8 27.9 43.1
Conditional Future Predictor 65.1 74.9 27.4 43.4
Composite Model 67.0 75.8 29.1 44.1
Composite Model with Conditional Future Predictor 67.1 75.8 29.2 44.0

Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
explicitly computed flow features only. Models in the third HMDB-
Method UCF-101
set use both. 51

On RGB data, our model performs at par with the best deep Spatial Convolutional Net (Simonyan &
73.0 40.5
Zisserman, 2014a)
models. It performs 3% better than the LRCN model that C3D (Tran et al., 2014) 72.3 -
also used LSTMs on top of convnet features1 . Our model C3D + fc6 (Tran et al., 2014) 76.4 -
performs better than C3D features that use a 3D convolu- LRCN (Donahue et al., 2014) 71.1 -
tional net. However, when the C3D features are concate- Composite LSTM Model 75.8 44.0
nated with fc6 percepts, they do slightly better than our Temporal Convolutional Net (Simonyan &
83.7 54.6
model. Zisserman, 2014a)
LRCN (Donahue et al., 2014) 77.0 -
The improvement for flow features over using a randomly Composite LSTM Model 77.7 -
initialized LSTM network is quite small. We believe this is
LRCN (Donahue et al., 2014) 82.9 -
atleast partly due to the fact that the flow percepts already Two-stream Convolutional Net (Simonyan &
capture a lot of the motion information that the LSTM 88.0 59.4
Zisserman, 2014a)
would otherwise discover. Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
Composite LSTM Model 84.3 -
When we combine predictions from the RGB and flow
models, we obtain 84.3 accuracy on UCF-101. We believe Table 4. Comparison with state-of-the-art action recognition
further improvements can be made by running the model models.
over different patch locations and mirroring the patches.
Also, our model can be applied deeper inside the convnet 4. Conclusions
instead of just at the top-level. That can potentially lead to We proposed models based on LSTMs that can learn good
further improvements. In this paper, we focus on showing video representations. We compared them and analyzed
that unsupervised training helps consistently across both their properties through visualizations. Moreover, we man-
datasets and across different sized training sets. aged to get an improvement on supervised tasks. The best
1 performing model was the Composite Model that combined
However, the improvement is only partially from unsuper-
vised learning, since we used a better convnet model. an autoencoder and a future predictor. Conditioning on
generated outputs did not have a significant impact on the
Unsupervised Learning with LSTMs

performance for supervised tasks, however it made the fu- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T.
ture predictions look slightly better. The model was able to HMDB: a large video database for human motion recognition.
persistently generate motion well beyond the time scales it In Proceedings of the International Conference on Computer
Vision (ICCV), 2011.
was trained for. However, it lost the precise object features
rapidly after the training time scale. The features at the in- Lan, Zhen-Zhong, Lin, Ming, Li, Xuanchong, Hauptmann,
put and output layers were found to have some interesting Alexander G., and Raj, Bhiksha. Beyond gaussian pyramid:
Multi-skip feature stacking for action recognition. CoRR,
properties. abs/1411.6660, 2014.
To further get improvements for supervised tasks, we be- Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning hi-
lieve that the model can be extended by applying it convo- erarchical spatio-temporal features for action recognition with
lutionally across patches of the video and stacking multiple independent subspace analysis. In CVPR, 2011.
layers of such models. Applying this model in the lower Memisevic, Roland. Learning to relate images. IEEE Trans-
layers of a convolutional net could help extract motion in- actions on Pattern Analysis and Machine Intelligence, 35(8):
formation that would otherwise be lost across max-pooling 1829–1846, 2013.
layers. In our future work, we plan to build models based Memisevic, Roland and Hinton, Geoffrey E. Learning to represent
on these autoencoders from the bottom up instead of apply- spatial transformations with factored higher-order boltzmann
ing them only to percepts. machines. Neural Computation, 22(6):1473–1492, June 2010.

Acknowledgments Michalski, Vincent, Memisevic, Roland, and Konda, Kishore.

We acknowledge the support of Samsung, Raytheon BBN Modeling deep temporal dependencies with recurrent grammar
cells. In Advances in Neural Information Processing Systems
Technologies, and NVIDIA Corporation for the donation 27, pp. 1925–1933. Curran Associates, Inc., 2014.
of a GPU used for this research. The authors would like to
thank Geoffrey Hinton and Ilya Sutskever for helpful dis- Ranzato, Marc’Aurelio, Szlam, Arthur, Bruna, Joan, Mathieu,
Michaël, Collobert, Ronan, and Chopra, Sumit. Video (lan-
cussions and comments.
guage) modeling: a baseline for generative models of natural
References videos. CoRR, abs/1412.6604, 2014.
Cho, Kyunghyun, van Merrienboer, Bart, Gülçehre, Çaglar, Bah- Simonyan, K. and Zisserman, A. Two-stream convolutional net-
danau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Ben- works for action recognition in videos. In Advances in Neural
gio, Yoshua. Learning phrase representations using RNN Information Processing Systems, 2014a.
encoder-decoder for statistical machine translation. In Pro-
ceedings of the 2014 Conference on Empirical Methods in Simonyan, K. and Zisserman, A. Very deep convolu-
Natural Language Processing, EMNLP 2014, pp. 1724–1734, tional networks for large-scale image recognition. CoRR,
2014. abs/1409.1556, 2014b.
Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Soomro, k., Roshan Zamir, A., and Shah, M. UCF101: A dataset
Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, of 101 human actions classes from videos in the wild. In
and Darrell, Trevor. Long-term recurrent convolutional CRCV-TR-12-01, 2012.
networks for visual recognition and description. CoRR,
Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M. Mod-
abs/1411.4389, 2014.
eling the joint density of two images under a variety of trans-
Graves, Alex. Generating sequences with recurrent neural net- formations. In Proceedings of IEEE Conference on Computer
works. CoRR, abs/1308.0850, 2013. Vision and Pattern Recognition, 2011.
Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. V. Sequence to
recognition with recurrent neural networks. In Proceedings sequence learning with neural networks. In Advances in Neural
of the 31st International Conference on Machine Learning Information Processing Systems 27, pp. 3104–3112. 2014.
(ICML-14), pp. 1764–1772, 2014.
Tran, Du, Bourdev, Lubomir D., Fergus, Rob, Torresani, Lorenzo,
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term and Paluri, Manohar. C3D: generic features for video analysis.
memory. Neural Computation, 9(8):1735–1780, 1997. CoRR, abs/1412.0767, 2014.
Hurri, Jarmo and Hyvärinen, Aapo. Simple-cell-like receptive van Hateren, J. H. and Ruderman, D. L. Independent component
fields maximize temporal coherence in natural video. Neural analysis of natural image sequences yields spatio-temporal fil-
Computation, 15(3):663–691, 2003. ters similar to simple cells in primary visual cortex. Proceed-
ings. Biological sciences / The Royal Society, 265(1412):2315–
Ji, Shuiwang, Xu, Wei, Yang, Ming, and Yu, Kai. 3d convolu- 2320, 1998.
tional neural networks for human action recognition. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 35 Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Du-
(1):221–231, Jan 2013. mitru. Show and tell: A neural image caption generator. CoRR,
abs/1411.4555, 2014.
Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung,
Thomas, Sukthankar, Rahul, and Fei-Fei, Li. Large-scale video Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Re-
classification with convolutional neural networks. In CVPR, current neural network regularization. CoRR, abs/1409.2329,
2014. 2014.

lec-10
No ratings yet
lec-10
37 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
11.RNN and Transformers
No ratings yet
11.RNN and Transformers
100 pages
3128/submission 3128
No ratings yet
3128/submission 3128
10 pages
CSE 4237 SoftCom Solutions
No ratings yet
CSE 4237 SoftCom Solutions
115 pages
chapter 2
No ratings yet
chapter 2
68 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Research on Learning Representations in Computer Vision
No ratings yet
Research on Learning Representations in Computer Vision
52 pages
lecture 11
No ratings yet
lecture 11
57 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
RNN LSTM
No ratings yet
RNN LSTM
37 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
A_review_on_the_Long_short_term_memory_model
No ratings yet
A_review_on_the_Long_short_term_memory_model
34 pages
Exploring LSTMs
No ratings yet
Exploring LSTMs
35 pages
LSTM
No ratings yet
LSTM
27 pages
Elative Representations Enable Zero Shot Latent Space Communication
No ratings yet
Elative Representations Enable Zero Shot Latent Space Communication
20 pages
2003 - Probabilistic Future Prediction For Video Scene Understanding
No ratings yet
2003 - Probabilistic Future Prediction For Video Scene Understanding
21 pages
Google Certified Professional Cloud Architect
No ratings yet
Google Certified Professional Cloud Architect
464 pages
11-rnn
No ratings yet
11-rnn
32 pages
R3M: A Universal Visual Representation For Robot Manipulation
No ratings yet
R3M: A Universal Visual Representation For Robot Manipulation
18 pages
2501.05453v1
No ratings yet
2501.05453v1
19 pages
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
No ratings yet
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
13 pages
Deep+Learning+Approaches+to+Predict+Future+Frames+in+Videos
No ratings yet
Deep+Learning+Approaches+to+Predict+Future+Frames+in+Videos
17 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Time2Vec Embedding On A Seq2Seq Bi-Directional LSTM Network For Pedestrian Trajectory
No ratings yet
Time2Vec Embedding On A Seq2Seq Bi-Directional LSTM Network For Pedestrian Trajectory
12 pages
Neural Networks
No ratings yet
Neural Networks
22 pages
Transformers For One-Shot Visual Imitation: For Code and Project Video Please Check Our Website
No ratings yet
Transformers For One-Shot Visual Imitation: For Code and Project Video Please Check Our Website
14 pages
Lecture 2.3.5lstmencoders
No ratings yet
Lecture 2.3.5lstmencoders
9 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
10 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
No ratings yet
2020 - Zhang-Liang-Li-Wang-Wu - Research On Stock Prediction Model Based On Deep Learning - Journal of Physics Conference Series
8 pages
Video Representation Learning by Dense Predictive Coding 1909.04656
No ratings yet
Video Representation Learning by Dense Predictive Coding 1909.04656
13 pages
NIPS-2015-semi-supervised-sequence-learning-Paper
No ratings yet
NIPS-2015-semi-supervised-sequence-learning-Paper
9 pages
s42979-022-01498-y
No ratings yet
s42979-022-01498-y
16 pages
Comparing Hidden Markov Models and Long Short Term Memory Neural Networks For Learning Action Representations
No ratings yet
Comparing Hidden Markov Models and Long Short Term Memory Neural Networks For Learning Action Representations
12 pages
paper4
No ratings yet
paper4
12 pages
Unsupervised Learning of Video Representations using LSTMs
No ratings yet
Unsupervised Learning of Video Representations using LSTMs
9 pages
dis6-sol
No ratings yet
dis6-sol
6 pages
Full Resolution Image Compression With Recurrent Neural Networks
No ratings yet
Full Resolution Image Compression With Recurrent Neural Networks
9 pages
Mahasseni Unsupervised Video Summarization CVPR 2017 Paperdfdffvfdfgdgf
No ratings yet
Mahasseni Unsupervised Video Summarization CVPR 2017 Paperdfdffvfdfgdgf
10 pages
paper3
No ratings yet
paper3
9 pages
paper2
No ratings yet
paper2
7 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
Boundary Detector Encoder and Decoder With Soft Attention for Video Captioning
No ratings yet
Boundary Detector Encoder and Decoder With Soft Attention for Video Captioning
11 pages
Practice Question DL Unit-3
No ratings yet
Practice Question DL Unit-3
3 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
End-To-End Learning of Driving Models From Large-Scale Video Datasets
No ratings yet
End-To-End Learning of Driving Models From Large-Scale Video Datasets
9 pages
Electrical Inspector Manual
100% (4)
Electrical Inspector Manual
178 pages
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
No ratings yet
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
9 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Summary of Progress
No ratings yet
Summary of Progress
9 pages
Realtime Sign Language Gesture Word Recognition From Video Seque 2018
No ratings yet
Realtime Sign Language Gesture Word Recognition From Video Seque 2018
10 pages
Unsup Video
No ratings yet
Unsup Video
10 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
Research Notes
No ratings yet
Research Notes
9 pages
Unsupervised Learning of Video Representations Using Lstms
No ratings yet
Unsupervised Learning of Video Representations Using Lstms
12 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
7 pages
Albert-László Barabási & Eric Bonabeau, Scale-Free Networks
No ratings yet
Albert-László Barabási & Eric Bonabeau, Scale-Free Networks
10 pages
Method Statement For Testing & Commissioning of Light Fittings & Fixtures
100% (1)
Method Statement For Testing & Commissioning of Light Fittings & Fixtures
3 pages
Projects - JFS
No ratings yet
Projects - JFS
53 pages
DADM NOTES and Cheat Sheet
No ratings yet
DADM NOTES and Cheat Sheet
11 pages
COM0P
No ratings yet
COM0P
10 pages
ICT -1 unit 5 python
No ratings yet
ICT -1 unit 5 python
2 pages
Template Manual AREVA P441 P442 P444 E33 ENU TU2.20 V1.000
No ratings yet
Template Manual AREVA P441 P442 P444 E33 ENU TU2.20 V1.000
17 pages
2022-09-27 22点的log 0
No ratings yet
2022-09-27 22点的log 0
15 pages
SD 4.1 Portfolio Guide
No ratings yet
SD 4.1 Portfolio Guide
123 pages
243v5qhaba 01 Dfu Aen
No ratings yet
243v5qhaba 01 Dfu Aen
25 pages
Supply Chain & CRM-Linkage: Short Questions
No ratings yet
Supply Chain & CRM-Linkage: Short Questions
17 pages
p4c Matlab
No ratings yet
p4c Matlab
269 pages
Ch. 6 Lecture Slides For Chenming Hu Book: Modern Semiconductor Devices For ICs
100% (2)
Ch. 6 Lecture Slides For Chenming Hu Book: Modern Semiconductor Devices For ICs
70 pages
Noc18 cs48 Assignment3
100% (1)
Noc18 cs48 Assignment3
4 pages
Staff Selection Commission, Southern Region, Chennai
No ratings yet
Staff Selection Commission, Southern Region, Chennai
5 pages
Where Emp - Deptno Dept - Deptno and Sal 12 30000 and JOB 'CLERK'
100% (1)
Where Emp - Deptno Dept - Deptno and Sal 12 30000 and JOB 'CLERK'
8 pages
Fire Xc166
No ratings yet
Fire Xc166
69 pages
Data GER: High Performance, Easy Handling
No ratings yet
Data GER: High Performance, Easy Handling
8 pages
A Roadmap For BIM Adoption and Implementation in Developing Countries The Pakistan Case
No ratings yet
A Roadmap For BIM Adoption and Implementation in Developing Countries The Pakistan Case
21 pages
Turbo Macro Pro Editor Commands
No ratings yet
Turbo Macro Pro Editor Commands
4 pages
Amp Report
No ratings yet
Amp Report
7 pages
Alton Madyara CV
No ratings yet
Alton Madyara CV
4 pages
Vba - Format Slice of Excel Pie Chart Based On Horizontal Category Axis - Stack Overflow
No ratings yet
Vba - Format Slice of Excel Pie Chart Based On Horizontal Category Axis - Stack Overflow
3 pages
ISO 27001 ISMS Awareness Course Outline
No ratings yet
ISO 27001 ISMS Awareness Course Outline
2 pages
Recommendation of Agricultural Crop Based On Productivity and Season Using Machine Learning
No ratings yet
Recommendation of Agricultural Crop Based On Productivity and Season Using Machine Learning
9 pages
Introduction To Spatial Analysis: Module Organization
No ratings yet
Introduction To Spatial Analysis: Module Organization
9 pages
Counter Path X-Lite SIP Compatibility Report (Ver4.0)
No ratings yet
Counter Path X-Lite SIP Compatibility Report (Ver4.0)
6 pages
Question Paper For JTO Phase-I Data Communication: Fill in The Blanks/ Short Answer questions/True/False Type
No ratings yet
Question Paper For JTO Phase-I Data Communication: Fill in The Blanks/ Short Answer questions/True/False Type
3 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet