0% found this document useful (0 votes)
41 views12 pages

Unsupervised Learning of Video Representations Using Lstms

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava NITISH @ CS . TORONTO . EDU


Elman Mansimov EMANSIM @ CS . TORONTO . EDU
Ruslan Salakhutdinov RSALAKHU @ CS . TORONTO . EDU
University of Toronto, 6 Kings College Road, Toronto, ON M5S 3G4 CANADA

Abstract
arXiv:1502.04681v3 [cs.LG] 4 Jan 2016

et al., 2014), and caption generation for images (Vinyals


We use multilayer Long Short Term Memory et al., 2014). They have also been applied on videos for
(LSTM) networks to learn representations of recognizing actions and generating natural language de-
video sequences. Our model uses an encoder scriptions (Donahue et al., 2014). A general sequence to
LSTM to map an input sequence into a fixed sequence learning framework was described by Sutskever
length representation. This representation is de- et al. (2014) in which a recurrent network is used to encode
coded using single or multiple decoder LSTMs a sequence into a fixed length representation, and then an-
to perform different tasks, such as reconstruct- other recurrent network is used to decode a sequence out of
ing the input sequence, or predicting the future that representation. In this work, we apply and extend this
sequence. We experiment with two kinds of framework to learn representations of sequences of images.
input sequences – patches of image pixels and We choose to work in the unsupervised setting where we
high-level representations (“percepts”) of video only have access to a dataset of unlabelled videos.
frames extracted using a pretrained convolutional Videos are an abundant and rich source of visual infor-
net. We explore different design choices such mation and can be seen as a window into the physics of
as whether the decoder LSTMs should condi- the world we live in, showing us examples of what con-
tion on the generated output. We analyze the stitutes objects, how objects move against backgrounds,
outputs of the model qualitatively to see how what happens when cameras move and how things get oc-
well the model can extrapolate the learned video cluded. Being able to learn a representation that disen-
representation into the future and into the past. tangles these factors would help in making intelligent ma-
We try to visualize and interpret the learned fea- chines that can understand and act in their environment.
tures. We stress test the model by running it on Additionally, learning good video representations is essen-
longer time scales and on out-of-domain data. tial for a number of useful tasks, such as recognizing ac-
We further evaluate the representations by fine- tions and gestures.
tuning them for a supervised learning problem –
human action recognition on the UCF-101 and 1.1. Why Unsupervised Learning?
HMDB-51 datasets. We show that the represen-
tations help improve classification accuracy, es- Supervised learning has been extremely successful in learn-
pecially when there are only a few training ex- ing good visual representations that not only produce good
amples. Even models pretrained on unrelated results at the task they are trained for, but also transfer well
datasets (300 hours of YouTube videos) can help to other tasks and datasets. Therefore, it is natural to ex-
action recognition performance. tend the same approach to learning video representations.
This has led to research in 3D convolutional nets (Ji et al.,
2013; Tran et al., 2014), different temporal fusion strategies
1. Introduction (Karpathy et al., 2014) and exploring different ways of pre-
senting visual information to convolutional nets (Simonyan
Understanding temporal sequences is important for solv-
& Zisserman, 2014a). However, videos are much higher di-
ing many problems in the AI-set. Recently, recurrent neu-
mensional entities compared to single images. Therefore, it
ral networks using the Long Short Term Memory (LSTM)
becomes increasingly difficult to do credit assignment and
architecture (Hochreiter & Schmidhuber, 1997) have been
learn long range structure, unless we collect much more
used successfully to perform various supervised sequence
labelled data or do a lot of feature engineering (for exam-
learning tasks, such as speech recognition (Graves & Jaitly,
ple computing the right kinds of flow features) to keep the
2014), machine translation (Sutskever et al., 2014; Cho
Unsupervised Learning with LSTMs

dimensionality low. The costly work of collecting more tion recognition. If the unsupervised learning model comes
labelled data and the tedious work of doing more clever en- up with useful representations then the classifier should be
gineering can go a long way in solving particular problems, able to perform better, especially when there are only a few
but this is ultimately unsatisfying as a machine learning labelled examples. We find that this is indeed the case.
solution. This highlights the need for using unsupervised
learning to find and represent structure in videos. More- 1.3. Related Work
over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well The first approaches to learning representations of videos
suited as a domain for building unsupervised learning mod- in an unsupervised way were based on ICA (van Hateren
els. & Ruderman, 1998; Hurri & Hyvärinen, 2003). Le et al.
(2011) approached this problem using multiple layers of
Independent Subspace Analysis modules. Generative mod-
1.2. Our Approach
els for understanding transformations between pairs of con-
When designing any unsupervised learning model, it is cru- secutive images are also well studied (Memisevic, 2013;
cial to have the right inductive biases and choose the right Memisevic & Hinton, 2010; Susskind et al., 2011). This
objective function so that the learning signal points the work was extended recently by Michalski et al. (2014) to
model towards learning useful features. In this paper, we model longer sequences.
use the LSTM Encoder-Decoder framework to learn video
Recently, Ranzato et al. (2014) proposed a generative
representations. The key inductive bias here is that the
model for videos. The model uses a recurrent neural
same operation must be applied at each time step to prop-
network to predict the next frame or interpolate between
agate information to the next step. This enforces the fact
frames. In this work, the authors highlight the importance
that the physics of the world remains the same, irrespec-
of choosing the right loss function. It is argued that squared
tive of input. The same physics acting on any state, at any
loss in input space is not the right objective because it does
time, must produce the next state. Our model works as
not respond well to small distortions in input space. The
follows. The Encoder LSTM runs through a sequence of
proposed solution is to quantize image patches into a large
frames to come up with a representation. This representa-
dictionary and train the model to predict the identity of
tion is then decoded through another LSTM to produce a
the target patch. This does solve some of the problems of
target sequence. We consider different choices of the tar-
squared loss but it introduces an arbitrary dictionary size
get sequence. One choice is to predict the same sequence
into the picture and altogether removes the idea of patches
as the input. The motivation is similar to that of autoen-
being similar or dissimilar to one other. Designing an ap-
coders – we wish to capture all that is needed to reproduce
propriate loss function that respects our notion of visual
the input but at the same time go through the inductive bi-
similarity is a very hard problem (in a sense, almost as hard
ases imposed by the model. Another option is to predict the
as the modeling problem we want to solve in the first place).
future frames. Here the motivation is to learn a representa-
Therefore, in this paper, we use the simple squared loss ob-
tion that extracts all that is needed to extrapolate the motion
jective function as a starting point and focus on designing
and appearance beyond what has been observed. These two
an encoder-decoder RNN architecture that can be used with
natural choices can also be combined. In this case, there are
any loss function.
two decoder LSTMs – one that decodes the representation
into the input sequence and another that decodes the same
representation to predict the future. 2. Model Description
The inputs to the model can, in principle, be any represen- In this section, we describe several variants of our LSTM
tation of individual video frames. However, for the pur- Encoder-Decoder model. The basic unit of our network
poses of this work, we limit our attention to two kinds of is the LSTM cell block. Our implementation of LSTMs
inputs. The first is image patches. For this we use natural follows closely the one discussed by Graves (2013).
image patches as well as a dataset of moving MNIST digits.
The second is high-level “percepts” extracted by applying a 2.1. Long Short Term Memory
convolutional net trained on ImageNet. These percepts are
the states of last (and/or second-to-last) layers of rectified In this section we briefly describe the LSTM unit which is
linear hidden states from a convolutional neural net model. the basic building block of our model. The unit is shown in
Fig. 1 (reproduced from Graves (2013)).
In order to evaluate the learned representations we quali-
tatively analyze the reconstructions and predictions made Each LSTM unit has a cell which has a state ct at time t.
by the model. For a more quantitative evaluation, we use This cell can be thought of as a memory unit. Access to
these LSTMs as initializations for the supervised task of ac- this memory unit for reading or modifying it is controlled
through sigmoidal gates – input gate it , forget gate ft and
Unsupervised Learning with LSTMs

Learned vˆ3 vˆ2 vˆ1


Representation

W1 W1 copy W2 W2

v1 v2 v3 v3 v2

Figure 2. LSTM Autoencoder Model

vised learning. The model consists of two RNNs – the en-


coder LSTM and the decoder LSTM as shown in Fig. 2.
Figure 2: Figure
Long Short-term Memory Cell
1. LSTM unit The input to the model is a sequence of vectors (image
patches or features). The encoder LSTM reads in this se-
output gate ot . [16],
(LSTM) architecture ThewhichLSTM unit operates
uses purpose-built memoryas cells
follows. At
to store infor- quence. After the last input has been read, the decoder
each
mation,time stepat itfinding
is better receives inputs long
and exploiting fromrange
twodependencies
external sources
in the data.
Fig. 2 illustrates a single LSTM memory cell. For the version of LSTM used in LSTM takes over and outputs a prediction for the target se-
at each of the four terminals (the three gates and the input).
this paper [7] H is implemented by the following composite function: quence. The target sequence is same as the input sequence,
The first source is the current frame xt . The second source
i = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (7) but in reverse order. Reversing the target sequence makes
is the previoust hidden states of all LSTM units in the same
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (8) the optimization easier because the model can get off the
layer ht−1 . Additionally, each gate has an internal source,(9)
ct = ft ct−1 + it tanh (Wxc xt + Whc ht−1 + bc ) ground by looking at low range correlations. This is also
the cell state oct t−1 = σ (Wxo xt + Who hblock.
of its cell The links between (10)
t−1 + Wco ct + bo )
a
inspired by how lists are represented in LISP. The encoder
cell and its own ht =gates aret ) called peephole connections. The
ot tanh(c (11)
can be seen as creating a list by applying the cons func-
inputs
where σ is coming from
the logistic different
sigmoid function,sources
and i, f , get
o andadded up, alongthe
c are respectively tion on the previously constructed list and the new input.
with a bias.
input gate, The output
forget gate, gatesgate, arecellactivated by activation
and cell input passing vectors,
their to-all of
which are the same size as the hidden vector h. The weight matrix subscripts The decoder essentially unrolls this list, with the hidden to
tal input through the logistic function. The total input at
have the obvious meaning, for example Whi is the hidden-input gate matrix, output weights extracting the element at the top of the list
the
Wxo input terminal isgate
is the input-output passed
matrixthrough
etc. Thethe tanh
weight non-linearity.
matrices from the cell
to gateresulting
vectors (e.g. Wci ) are diagonal, so elementby m in each gate vector ofonly
(car function) and the hidden to hidden weights extract-
The activation is multiplied the activation
receives input from element m of the cell vector. The bias terms (which are ing the rest of the list (cdr function). Therefore, the first
the input gate. This is then added
added to i, f , c and o) have been omitted for clarity. to the cell state after mul-
element out is the last element in.
The original
tiplying LSTM
the cell algorithm
state by theused forgeta custom
gate’sdesigned ft . The
approximate
activation gradi-
ent calculation that allowed the weights to be updated after every timestep [16].
final output from the LSTM unit
However the full gradient can instead be calculated h is computed by multi-
t with backpropagation through The decoder can be of two kinds – conditional or uncondi-
plying
time [11],thethe output
method used gate’s activation
in this paper. One otdifficulty
with the updated
when training cell
LSTM tioned. A conditional decoder receives the last generated
with the full gradient is that the derivatives sometimes become excessively large,
state passed through a tanh non-linearity. These updates output frame as input, i.e., the dotted input in Fig. 2 is
are summarized for a layer of 5LSTM units as follows present. An unconditioned decoder does not receive that
it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) , input. This is discussed in more detail in Sec. 2.4. Fig. 2
shows a single layer LSTM Autoencoder. The architecture
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) ,
can be extend to multiple layers by stacking LSTMs on top
ct = ft ct−1 + it tanh (Wxc xt + Whc ht−1 + bc ) , of each other.
ot = σ (Wxo xt + Who ht−1 + Wco ct + bo ) ,
Why should this learn good features?
ht = ot tanh(ct ). The state of the encoder LSTM after the last input has been
Note that all Wc• matrices are diagonal, whereas the rest read is the representation of the input video. The decoder
are dense. The key advantage of using an LSTM unit over LSTM is being asked to reconstruct back the input se-
a traditional neuron in an RNN is that the cell state in an quence from this representation. In order to do so, the rep-
LSTM unit sums activities over time. Since derivatives dis- resentation must retain information about the appearance
tribute over sums, the error derivatives don’t vanish quickly of the objects and the background as well as the motion
as they get sent back into time. This makes it easy to do contained in the video. However, an important question for
credit assignment over long sequences and discover long- any autoencoder-style model is what prevents it from learn-
range features. ing an identity mapping and effectively copying the input
to the output. In that case all the information about the in-
put would still be present but the representation will be no
2.2. LSTM Autoencoder Model
better than the input. There are two factors that control this
In this section, we describe a model that uses Recurrent behaviour. First, the fact that there are only a fixed num-
Neural Nets (RNNs) made of LSTM units to do unsuper- ber of hidden units makes it unlikely that the model can
Unsupervised Learning with LSTMs

Learned vˆ4 vˆ5 vˆ6 Input Reconstruction vˆ3 vˆ2 vˆ1


Representation

W1 W1 copy W2 W2 W2 W2

Learned
copy
Representation
v1 v2 v3 v4 v5 v3 v2

W1 W1

Figure 3. LSTM Future Predictor Model vˆ4 vˆ5 vˆ6


copy

v1 v2 v3
learn trivial mappings for arbitrary length input sequences.
W3 W3
Second, the same LSTM operation is used to decode the
Sequence of Input Frames
representation recursively. This means that the same dy-
namics must be applied on the representation at any stage
of decoding. This further prevents the model from learning Future Prediction v4 v5
an identity mapping.

2.3. LSTM Future Predictor Model Figure 4. The Composite Model: The LSTM predicts the future
as well as the input sequence.
Another natural unsupervised learning task for sequences
is predicting the future. This is the approach used in lan-
target and hence a unimodal target distribution. But for the
guage models for modeling sequences of words. The de-
LSTM Future Predictor there is a possibility of multiple
sign of the Future Predictor Model is same as that of the
targets given an input because even if we assume a deter-
Autoencoder Model, except that the decoder LSTM in this
ministic universe, everything needed to predict the future
case predicts frames of the video that come after the in-
will not necessarily be observed in the input.
put sequence (Fig. 3). Ranzato et al. (2014) use a similar
model but predict only the next frame at each time step. There is also an argument against using a conditional
This model, on the other hand, predicts a long sequence decoder from the optimization point-of-view. There are
into the future. Here again we can consider two variants of strong short-range correlations in video data, for example,
the decoder – conditional and unconditioned. most of the content of a frame is same as the previous one.
If the decoder was given access to the last few frames while
Why should this learn good features?
generating a particular frame at training time, it would find
In order to predict the next few frames correctly, the model
it easy to pick up on these correlations. There would only
needs information about which objects and background are
be a very small gradient that tries to fix up the extremely
present and how they are moving so that the motion can
subtle errors that require long term knowledge about the
be extrapolated. The hidden state coming out from the en-
input sequence. In an unconditioned decoder, this input is
coder will try to capture this information. Therefore, this
removed and the model is forced to look for information
state can be seen as a representation of the input sequence.
deep inside the encoder.
2.4. Conditional Decoder
2.5. A Composite Model
For each of these two models, we can consider two possi-
The two tasks – reconstructing the input and predicting the
bilities - one in which the decoder LSTM is conditioned on
future can be combined to create a composite model as
the last generated frame and the other in which it is not. In
shown in Fig. 4. Here the encoder LSTM is asked to come
the experimental section, we explore these choices quanti-
up with a state from which we can both predict the next few
tatively. Here we briefly discuss arguments for and against
frames as well as reconstruct the input.
a conditional decoder. A strong argument in favour of using
a conditional decoder is that it allows the decoder to model This composite model tries to overcome the shortcomings
multiple modes in the target sequence distribution. With- that each model suffers on its own. A high-capacity au-
out that, we would end up averaging the multiple modes in toencoder would suffer from the tendency to learn trivial
the low-level input space. However, this is an issue only if representations that just memorize the inputs. However,
we expect multiple modes in the target sequence distribu- this memorization is not useful at all for predicting the fu-
tion. For the LSTM Autoencoder, there is only one correct ture. Therefore, the composite model cannot just memo-
Unsupervised Learning with LSTMs

rize information. On the other hand, the future predictor learning, and because we did not want to introduce any un-
suffers form the tendency to store information only about natural bias in the samples. We also used the supervised
the last few frames since those are most important for pre- datasets (UCF-101 and HMDB-51) for unsupervised train-
dicting the future, i.e., in order to predict vt , the frames ing. However, we found that using them did not give any
{vt−1 , . . . , vt−k } are much more important than v0 , for significant advantage over just using the YouTube videos.
some small value of k. Therefore the representation at the
We extracted percepts using the convolutional neural net
end of the encoder will have forgotten about a large part of
model of Simonyan & Zisserman (2014b). The videos
the input. But if we ask the model to also predict all of the
have a resolution of 240 × 320 and were sampled at al-
input sequence, then it cannot just pay attention to the last
most 30 frames per second. We took the central 224 × 224
few frames.
patch from each frame and ran it through the convnet. This
gave us the RGB percepts. Additionally, for UCF-101, we
3. Experiments computed flow percepts by extracting flows using the Brox
method and training the temporal stream convolutional net-
We design experiments to accomplish the following objec-
work as described by Simonyan & Zisserman (2014a). We
tives:
found that the fc6 features worked better than fc7 for sin-
• Get a qualitative understanding of what the LSTM gle frame classification using both RGB and flow percepts.
learns to do. Therefore, we used the 4096-dimensional fc6 layer as the
input representation of our data. Besides these percepts,
• Measure the benefit of initializing networks for super- we also trained the proposed models on 32 × 32 patches of
vised learning tasks with the weights found by unsu- pixels.
pervised learning, especially with very few training
All models were trained using backprop on a single
examples.
NVIDIA Titan GPU. A two layer 2048 unit Composite
• Compare the different proposed models - Autoen- model that predicts 13 frames and reconstructs 16 frames
coder, Future Predictor and Composite models and took 18-20 hours to converge on 300 hours of percepts. We
their conditional variants. initialized weights by sampling from a uniform distribu-
tion whose scale was set to 1/sqrt(fan-in). Biases at all
• Compare with state-of-the-art action recognition the gates were initialized to zero. Peep-hole connections
benchmarks. were initialized to zero. The supervised classifiers trained
on 16 frames took 5-15 minutes to converge. The code
3.1. Datasets can be found at https://github.com/emansim/
unsupervised-videos.
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012) 3.2. Visualization and Qualitative Analysis
contains 13,320 videos with an average length of 6.2 sec-
onds belonging to 101 different action categories. The The aim of this set of experiments to visualize the proper-
dataset has 3 standard train/test splits with the training set ties of the proposed models.
containing around 9,500 videos in each split (the rest are Experiments on MNIST
test). The HMDB-51 dataset (Kuehne et al., 2011) contains We first trained our models on a dataset of moving MNIST
5100 videos belonging to 51 different action categories. digits. In this dataset, each video was 20 frames long and
Mean length of the videos is 3.2 seconds. This also has consisted of two digits moving inside a 64 × 64 patch.
3 train/test splits with 3570 videos in the training set and The digits were chosen randomly from the training set and
rest in test. placed initially at random locations inside the patch. Each
To train the unsupervised models, we used a subset of the digit was assigned a velocity whose direction was chosen
Sports-1M dataset (Karpathy et al., 2014), that contains uniformly randomly on a unit circle and whose magnitude
1 million YouTube clips. Even though this dataset is la- was also chosen uniformly at random over a fixed range.
belled for actions, we did not do any supervised experi- The digits bounced-off the edges of the 64 × 64 frame and
ments on it because of logistical constraints with working overlapped if they were at the same location. The reason
with such a huge dataset. We instead collected 300 hours for working with this dataset is that it is infinite in size and
of video by randomly sampling 10 second clips from the can be generated quickly on the fly. This makes it possi-
dataset. It is possible to collect better samples if instead of ble to explore the model without expensive disk accesses
choosing randomly, we extracted videos where a lot of mo- or overfitting issues. It also has interesting behaviours due
tion is happening and where there are no shot boundaries. to occlusions and the dynamics of bouncing off the walls.
However, we did not do so in the spirit of unsupervised
Unsupervised Learning with LSTMs
 Input Sequence -  Ground Truth Future -

 Input Reconstruction -  Future Prediction -

One Layer Composite Model

Two Layer Composite Model

Two Layer Composite Model with a Conditional Future Predictor

Figure 5. Reconstruction and future prediction obtained from the Composite Model on a dataset of moving MNIST digits.
We first trained a single layer Composite Model. Each 2048 units. We found that the reconstructions and the pre-
LSTM had 2048 units. The encoder took 10 frames as in- dictions are both very blurry. We then trained a bigger
put. The decoder tried to reconstruct these 10 frames and model with 4096 units. The outputs from this model are
the future predictor attempted to predict the next 10 frames. also shown in Fig. 6. We can see that the reconstructions
We used logistic output units with a cross entropy loss func- get much sharper.
tion. Fig. 5 shows two examples of running this model.
Generalization over time scales
The true sequences are shown in the first two rows. The
In the next experiment, we test if the model can work
next two rows show the reconstruction and future predic-
at time scales that are different than what it was trained
tion from the one layer Composite Model. It is interesting
on. We take a one hidden layer unconditioned Compos-
to note that the model figures out how to separate superim-
ite Model trained on moving MNIST digits. The model
posed digits and can model them even as they pass through
has 2048 LSTM units and looks at a 64 × 64 input. It
each other. This shows some evidence of disentangling the
was trained on input sequences of 10 frames to reconstruct
two independent factors of variation in this sequence. The
those 10 frames as well as predict 10 frames into the fu-
model can also correctly predict the motion after bounc-
ture. In order to test if the future predictor is able to gen-
ing off the walls. In order to see if adding depth helps,
eralize beyond 10 frames, we let the model run for 100
we trained a two layer Composite Model, with each layer
steps into the future. Fig. 7(a) shows the pattern of ac-
having 2048 units. We can see that adding depth helps the
tivity in the LSTM units of the future predictor pathway
model make better predictions. Next, we changed the fu-
for a randomly chosen test input. It shows the activity
ture predictor by making it conditional. We can see that
at each of the three sigmoidal gates (input, forget, out-
this model makes sharper predictions.
put), the input (after the tanh non-linearity, before being
Experiments on Natural Image Patches multiplied by the input gate), the cell state and the final
Next, we tried to see if our models can also work with nat- output (after being multiplied by the output gate). Even
ural image patches. For this, we trained the models on se- though the units are ordered randomly along the vertical
quences of 32 × 32 natural image patches extracted from axis, we can see that the dynamics has a periodic quality
the UCF-101 dataset. In this case, we used linear output to it. The model is able to generate persistent motion for
units and the squared error loss function. The input was long periods of time. In terms of reconstruction, the model
16 frames and the model was asked to reconstruct the 16 only outputs blobs after the first 15 frames, but the motion
frames and predict the future 13 frames. Fig. 6 shows the is relatively well preserved. More results, including long
results obtained from a two layer Composite model with range future predictions over hundreds of time steps can see
Unsupervised Learning with LSTMs
 Input Sequence - Ground Truth Future -

 Input Reconstruction - Future Prediction -

Two Layer Composite Model with 2048 LSTM units

Two Layer Composite Model with 4096 LSTM units

Figure 6. Reconstruction and future prediction obtained from the Composite Model on a dataset of natural image patches. The first two
rows show ground truth sequences. The model takes 16 frames as inputs. Only the last 10 frames of the input sequence are shown here.
The next 13 frames are the ground truth future. In the rows that follow, we show the reconstructed and predicted frames for two instances
of the model.
been at http://www.cs.toronto.edu/˜nitish/ look like higher frequency strips. It is conceivable that the
unsupervised_video. To show that setting up a pe- high frequency features help in encoding the direction and
riodic behaviour is not trivial, Fig. 7(b) shows the activ- velocity of motion.
ity from a randomly initialized future predictor. Here, the
Fig. 10 shows the output features from the two LSTM de-
LSTM state quickly converges and the outputs blur com-
coders of a Composite Model. These correspond to the
pletely.
weights connecting the LSTM output units to the output
Out-of-domain Inputs layer. They appear to be somewhat qualitatively different
Next, we test this model’s ability to deal with out-of- from the input features shown in Fig. 9. There are many
domain inputs. For this, we test the model on sequences more output features that are local blobs, whereas those are
of one and three moving digits. The model was trained on rare in the input features. In the output features, the ones
sequences of two moving digits, so it has never seen in- that do look like strips are much shorter than those in the
puts with just one digit or three digits. Fig. 8 shows the input features. One way to interpret this is the following.
reconstruction and future prediction results. For one mov- The model needs to know about motion (which direction
ing digit, we can see that the model can do a good job but and how fast things are moving) from the input. This re-
it really tries to hallucinate a second digit overlapping with quires precise information about location (thin strips) and
the first one. The second digit shows up towards the end velocity (high frequency strips). But when it is generating
of the future reconstruction. For three digits, the model the output, the model wants to hedge its bets so that it does
merges digits into blobs. However, it does well at getting not suffer a huge loss for predicting things sharply at the
the overall motion right. This highlights a key drawback of wrong place. This could explain why the output features
modeling entire frames of input in a single pass. In order to have somewhat bigger blobs. The relative shortness of the
model videos with variable number of objects, we perhaps strips in the output features can be explained by the fact that
need models that not only have an attention mechanism in in the inputs, it does not hurt to have a longer feature than
place, but can also learn to execute themselves a variable what is needed to detect a location because information is
number of times and do variable amounts of computation. coarse-coded through multiple features. But in the output,
the model may not want to put down a feature that is bigger
Visualizing Features
than any digit because other units will have to conspire to
Next, we visualize the features learned by this model.
correct for it.
Fig. 9 shows the weights that connect each input frame to
the encoder LSTM. There are four sets of weights. One
set of weights connects the frame to the input units. There 3.3. Action Recognition on UCF-101/HMDB-51
are three other sets, one corresponding to each of the three The aim of this set of experiments is to see if the features
gates (input, forget and output). Each weight has a size of learned by unsupervised learning can help improve perfor-
64 × 64. A lot of features look like thin strips. Others
Unsupervised Learning with LSTMs

(a) Trained Future Predictor

0 Input Gates 0 Forget Gates 0 Input 0 Output Gates 0 Cell States 0 Output

50 50 50 50 50 50

100 100 100 100 100 100

150 150 150 150 150 150

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

(b) Randomly Initialized Future Predictor


Figure 7. Pattern of activity in 200 randomly chosen LSTM units in the Future Predictor of a 1 layer (unconditioned) Composite Model
trained on moving MNIST digits. The vertical axis corresponds to different LSTM units. The horizontal axis is time. The model was
only trained to predict the next 10 frames, but here we let it run to predict the next 100 frames. Top: The dynamics has a periodic quality
which does not die out. Bottom : The pattern of activity, if the trained weights in the future predictor are replaced by random weights.
The dynamics quickly dies out.
mance on supervised tasks. not through time within the same LSTM as proposed in
Zaremba et al. (2014). We emphasize that this is a very
We trained a two layer Composite Model with 2048 hid-
strong baseline and does significantly better than just using
den units with no conditioning on either decoders. The
single frames. Using dropout was crucial in order to train
model was trained on percepts extracted from 300 hours
good baseline models especially with very few training ex-
of YouTube data. The model was trained to autoencode
amples.
16 frames and predict the next 13 frames. We initialize an
LSTM classifier with the weights learned by the encoder
LSTM from this model. The classifier is shown in Fig. 11. y1 y2 ... yT
The output from each LSTM in the second layer goes into a
softmax classifier that makes a prediction about the action
being performed at each time step. Since only one action is
W (2) W (2) ... W (2)
being performed in each video in the datasets we consider,
the target is the same at each time step. At test time, the
predictions made at each time step are averaged. To get a
prediction for the entire video, we average the predictions W (1) W (1) W (1)
...
from all 16 frame blocks in the video with a stride of 8
frames. Using a smaller stride did not improve results.
The baseline for comparing these models is an identical
LSTM classifier but with randomly initialized weights. All v1 v2 ... vT
classifiers used dropout regularization, where we dropped
activations as they were communicated across layers but
Figure 11. LSTM Classifier.
Unsupervised Learning with LSTMs
 Input Sequence -  Ground Truth Future -

 Input Reconstruction -  Future Prediction -

Input Input Gates

Figure 8. Out-of-domain runs. Reconstruction and Future prediction for test sequences of one and three moving digits. The model was
trained on sequences of two moving digits.

Forget Gates Output Gates

(a) Inputs (b) Input Gates

(c) Forget Gates (d) Output Gates

Figure 9. Input features from a Composite Model trained on moving MNIST digits. In an LSTM, each input frame is connected to four
sets of units - the input, the input gate, forget gate and output gate. These figures show the top-200 features ordered by L2 norm of the
input features. The features in corresponding locations belong to the same LSTM unit.

(a) Input Reconstruction (b) Future Prediction

Figure 10. Output features from the two decoder LSTMs of a Composite Model trained on moving MNIST digits. These figures show
the top-200 features ordered by L2 norm.
Unsupervised Learning with LSTMs

UCF-101 UCF-101 HMDB-51 Squared loss


Model Cross Entropy
RGB 1- frame flow RGB Model on image
on MNIST
patches
Single Frame 72.2 72.2 40.1
LSTM classifier 74.5 74.3 42.8 Future Predictor 350.2 225.2
Composite LSTM Composite Model 344.9 210.7
75.8 74.9 44.1
Model + Finetuning Conditional Future Predictor 343.5 221.3
Composite Model with
341.2 208.1
Table 1. Summary of Results on Action Recognition. Conditional Future Predictor

Fig. 12 compares three models - single frame classifier Table 2. Future prediction results on MNIST and image patches.
(logistic regression), baseline LSTM classifier and the All models use 2 layers of LSTMs.
LSTM classifier initialized with weights from the Com-
posite Model as the number of labelled videos per class is present results from these two analyses.
varied. Note that having one labelled video means having
Future prediction results are summarized in Table 2. For
many labelled 16 frame blocks. We can see that for the case
MNIST we compute the cross entropy of the predictions
of very few training examples, unsupervised learning gives
with respect to the ground truth, both of which are 64 ×
a substantial improvement. For example, for UCF-101, the
64 patches. For natural image patches, we compute the
performance improves from 29.6% to 34.3% when train-
squared loss. We see that the Composite Model always
ing on only one labelled video. As the size of the labelled
does a better job of predicting the future compared to the
dataset grows, the improvement becomes smaller. Even for
Future Predictor. This indicates that having the autoen-
the full UCF-101 dataset we still get a considerable im-
coder along with the future predictor to force the model
provement from 74.5% to 75.8%. On HMDB-51, the im-
to remember more about the inputs actually helps predict
provement is from 42.8% to 44.0% for the full dataset (70
the future better. Next, we can compare each model with
videos per class) and 14.4% to 19.1% for one video per
its conditional variant. Here, we find that the conditional
class. Although, the improvement in classification by us-
models perform better, as was also noted in Fig. 5.
ing unsupervised learning was not as big as we expected,
we still managed to yield an additional improvement over Next, we compare the models using performance on a su-
a strong baseline. We discuss some avenues for improve- pervised task. Table 3 shows the performance on action
ments later. recognition achieved by finetuning different unsupervised
learning models. Besides running the experiments on the
We further ran similar experiments on the optical flow per-
full UCF-101 and HMDB-51 datasets, we also ran the ex-
cepts extracted from the UCF-101 dataset. A temporal
periments on small subsets of these to better highlight the
stream convolutional net, similar to the one proposed by Si-
case where we have very few training examples. We find
monyan & Zisserman (2014b), was trained on single frame
that all unsupervised models improve over the baseline
optical flows as well as on stacks of 10 optical flows. This
LSTM which is itself well-regularized by using dropout.
gave an accuracy of 72.2% and 77.5% respectively. Here
The Autoencoder model seems to perform consistently bet-
again, our models took 16 frames as input, reconstructed
ter than the Future Predictor. The Composite model which
them and predicted 13 frames into the future. LSTMs with
combines the two does better than either one alone. Con-
128 hidden units improved the accuracy by 2.1% to 74.3%
ditioning on the generated inputs does not seem to give a
for the single frame case. Bigger LSTMs did not improve
clear advantage over not doing so. The Composite Model
results. By pretraining the LSTM, we were able to further
with a conditional future predictor works the best, although
improve the classification to 74.9% (±0.1). For stacks of
its performance is almost same as that of the Composite
10 frames we improved very slightly to 77.7%. These re-
Model.
sults are summarized in Table 1.
3.5. Comparison with Other Action Recognition
3.4. Comparison of Different Model Variants
Benchmarks
The aim of this set of experiments is to compare the dif-
Finally, we compare our models to the state-of-the-art ac-
ferent variants of the model proposed in this paper. Since
tion recognition results. The performance is summarized in
it is always possible to get lower reconstruction error by
Table 4. The table is divided into three sets. The first set
copying the inputs, we cannot use input reconstruction er-
compares models that use only RGB data (single or mul-
ror as a measure of how good a model is doing. However,
tiple frames). The second set compares models that use
we can use the error in predicting the future as a reasonable
explicitly computed flow features only. Models in the third
measure of how good the model is doing. Besides, we can
set use both.
use the performance on supervised tasks as a proxy for how
good the unsupervised model is doing. In this section, we On RGB data, our model performs at par with the best deep
Unsupervised Learning with LSTMs

80 50

45
70
40

Classification Accuracy
Classification Accuracy
60 35

30
50
25

40 20

Single Frame 15 Single Frame


30 LSTM LSTM
10
LSTM + Pretraining LSTM + Pretraining
20 5
1 2 4 10 20 50 100 1 2 4 8 16 32 64
Training Examples per class Training Examples per class

(a) UCF-101 RGB (b) HMDB-51 RGB


Figure 12. Effect of pretraining on action recognition with change in the size of the labelled training set. The error bars are over 10
different samples of training sets.
Method UCF-101 small UCF-101 HMDB-51 small HMDB-51
Baseline LSTM 63.7 74.5 25.3 42.8
Autoencoder 66.2 75.1 28.6 44.0
Future Predictor 64.9 74.9 27.3 43.1
Conditional Autoencoder 65.8 74.8 27.9 43.1
Conditional Future Predictor 65.1 74.9 27.4 43.4
Composite Model 67.0 75.8 29.1 44.1
Composite Model with Conditional Future Predictor 67.1 75.8 29.2 44.0

Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
models. It performs 3% better than the LRCN model that HMDB-
Method UCF-101
also used LSTMs on top of convnet features1 . Our model 51
performs better than C3D features that use a 3D convolu- Spatial Convolutional Net (Simonyan &
73.0 40.5
tional net. However, when the C3D features are concate- Zisserman, 2014a)
nated with fc6 percepts, they do slightly better than our C3D (Tran et al., 2014) 72.3 -
model. C3D + fc6 (Tran et al., 2014) 76.4 -
LRCN (Donahue et al., 2014) 71.1 -
The improvement for flow features over using a randomly Composite LSTM Model 75.8 44.0
initialized LSTM network is quite small. We believe this is Temporal Convolutional Net (Simonyan &
83.7 54.6
atleast partly due to the fact that the flow percepts already Zisserman, 2014a)
capture a lot of the motion information that the LSTM LRCN (Donahue et al., 2014) 77.0 -
would otherwise discover. Composite LSTM Model 77.7 -
LRCN (Donahue et al., 2014) 82.9 -
When we combine predictions from the RGB and flow Two-stream Convolutional Net (Simonyan &
models, we obtain 84.3 accuracy on UCF-101. We believe 88.0 59.4
Zisserman, 2014a)
further improvements can be made by running the model Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
over different patch locations and mirroring the patches. Composite LSTM Model 84.3 -
Also, our model can be applied deeper inside the convnet
instead of just at the top-level. That can potentially lead to Table 4. Comparison with state-of-the-art action recognition
further improvements. In this paper, we focus on showing models.
that unsupervised training helps consistently across both 4. Conclusions
datasets and across different sized training sets. We proposed models based on LSTMs that can learn good
1 video representations. We compared them and analyzed
However, the improvement is only partially from unsuper-
vised learning, since we used a better convnet model.
their properties through visualizations. Moreover, we man-
aged to get an improvement on supervised tasks. The best
performing model was the Composite Model that combined
an autoencoder and a future predictor. Conditioning on
generated outputs did not have a significant impact on the
Unsupervised Learning with LSTMs

performance for supervised tasks, however it made the fu- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T.
ture predictions look slightly better. The model was able to HMDB: a large video database for human motion recognition.
persistently generate motion well beyond the time scales it In Proceedings of the International Conference on Computer
Vision (ICCV), 2011.
was trained for. However, it lost the precise object features
rapidly after the training time scale. The features at the in- Lan, Zhen-Zhong, Lin, Ming, Li, Xuanchong, Hauptmann,
put and output layers were found to have some interesting Alexander G., and Raj, Bhiksha. Beyond gaussian pyramid:
Multi-skip feature stacking for action recognition. CoRR,
properties. abs/1411.6660, 2014.
To further get improvements for supervised tasks, we be- Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning hi-
lieve that the model can be extended by applying it convo- erarchical spatio-temporal features for action recognition with
lutionally across patches of the video and stacking multiple independent subspace analysis. In CVPR, 2011.
layers of such models. Applying this model in the lower Memisevic, Roland. Learning to relate images. IEEE Trans-
layers of a convolutional net could help extract motion in- actions on Pattern Analysis and Machine Intelligence, 35(8):
formation that would otherwise be lost across max-pooling 1829–1846, 2013.
layers. In our future work, we plan to build models based Memisevic, Roland and Hinton, Geoffrey E. Learning to represent
on these autoencoders from the bottom up instead of apply- spatial transformations with factored higher-order boltzmann
ing them only to percepts. machines. Neural Computation, 22(6):1473–1492, June 2010.

Acknowledgments Michalski, Vincent, Memisevic, Roland, and Konda, Kishore.


We acknowledge the support of Samsung, Raytheon BBN Modeling deep temporal dependencies with recurrent grammar
cells. In Advances in Neural Information Processing Systems
Technologies, and NVIDIA Corporation for the donation 27, pp. 1925–1933. Curran Associates, Inc., 2014.
of a GPU used for this research. The authors would like to
thank Geoffrey Hinton and Ilya Sutskever for helpful dis- Ranzato, Marc’Aurelio, Szlam, Arthur, Bruna, Joan, Mathieu,
Michaël, Collobert, Ronan, and Chopra, Sumit. Video (lan-
cussions and comments.
guage) modeling: a baseline for generative models of natural
References videos. CoRR, abs/1412.6604, 2014.
Cho, Kyunghyun, van Merrienboer, Bart, Gülçehre, Çaglar, Bah- Simonyan, K. and Zisserman, A. Two-stream convolutional net-
danau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Ben- works for action recognition in videos. In Advances in Neural
gio, Yoshua. Learning phrase representations using RNN Information Processing Systems, 2014a.
encoder-decoder for statistical machine translation. In Pro-
ceedings of the 2014 Conference on Empirical Methods in Simonyan, K. and Zisserman, A. Very deep convolu-
Natural Language Processing, EMNLP 2014, pp. 1724–1734, tional networks for large-scale image recognition. CoRR,
2014. abs/1409.1556, 2014b.
Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Soomro, k., Roshan Zamir, A., and Shah, M. UCF101: A dataset
Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, of 101 human actions classes from videos in the wild. In
and Darrell, Trevor. Long-term recurrent convolutional CRCV-TR-12-01, 2012.
networks for visual recognition and description. CoRR,
Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M. Mod-
abs/1411.4389, 2014.
eling the joint density of two images under a variety of trans-
Graves, Alex. Generating sequences with recurrent neural net- formations. In Proceedings of IEEE Conference on Computer
works. CoRR, abs/1308.0850, 2013. Vision and Pattern Recognition, 2011.
Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. V. Sequence to
recognition with recurrent neural networks. In Proceedings sequence learning with neural networks. In Advances in Neural
of the 31st International Conference on Machine Learning Information Processing Systems 27, pp. 3104–3112. 2014.
(ICML-14), pp. 1764–1772, 2014.
Tran, Du, Bourdev, Lubomir D., Fergus, Rob, Torresani, Lorenzo,
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term and Paluri, Manohar. C3D: generic features for video analysis.
memory. Neural Computation, 9(8):1735–1780, 1997. CoRR, abs/1412.0767, 2014.
Hurri, Jarmo and Hyvärinen, Aapo. Simple-cell-like receptive van Hateren, J. H. and Ruderman, D. L. Independent component
fields maximize temporal coherence in natural video. Neural analysis of natural image sequences yields spatio-temporal fil-
Computation, 15(3):663–691, 2003. ters similar to simple cells in primary visual cortex. Proceed-
ings. Biological sciences / The Royal Society, 265(1412):2315–
Ji, Shuiwang, Xu, Wei, Yang, Ming, and Yu, Kai. 3d convolu- 2320, 1998.
tional neural networks for human action recognition. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 35 Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Du-
(1):221–231, Jan 2013. mitru. Show and tell: A neural image caption generator. CoRR,
abs/1411.4555, 2014.
Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung,
Thomas, Sukthankar, Rahul, and Fei-Fei, Li. Large-scale video Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Re-
classification with convolutional neural networks. In CVPR, current neural network regularization. CoRR, abs/1409.2329,
2014. 2014.

You might also like