Unsupervised Learning of Video Representations Using Lstms
Unsupervised Learning of Video Representations Using Lstms
Unsupervised Learning of Video Representations Using Lstms
Abstract
arXiv:1502.04681v3 [cs.LG] 4 Jan 2016
dimensionality low. The costly work of collecting more tion recognition. If the unsupervised learning model comes
labelled data and the tedious work of doing more clever en- up with useful representations then the classifier should be
gineering can go a long way in solving particular problems, able to perform better, especially when there are only a few
but this is ultimately unsatisfying as a machine learning labelled examples. We find that this is indeed the case.
solution. This highlights the need for using unsupervised
learning to find and represent structure in videos. More- 1.3. Related Work
over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well The first approaches to learning representations of videos
suited as a domain for building unsupervised learning mod- in an unsupervised way were based on ICA (van Hateren
els. & Ruderman, 1998; Hurri & Hyvärinen, 2003). Le et al.
(2011) approached this problem using multiple layers of
Independent Subspace Analysis modules. Generative mod-
1.2. Our Approach
els for understanding transformations between pairs of con-
When designing any unsupervised learning model, it is cru- secutive images are also well studied (Memisevic, 2013;
cial to have the right inductive biases and choose the right Memisevic & Hinton, 2010; Susskind et al., 2011). This
objective function so that the learning signal points the work was extended recently by Michalski et al. (2014) to
model towards learning useful features. In this paper, we model longer sequences.
use the LSTM Encoder-Decoder framework to learn video
Recently, Ranzato et al. (2014) proposed a generative
representations. The key inductive bias here is that the
model for videos. The model uses a recurrent neural
same operation must be applied at each time step to prop-
network to predict the next frame or interpolate between
agate information to the next step. This enforces the fact
frames. In this work, the authors highlight the importance
that the physics of the world remains the same, irrespec-
of choosing the right loss function. It is argued that squared
tive of input. The same physics acting on any state, at any
loss in input space is not the right objective because it does
time, must produce the next state. Our model works as
not respond well to small distortions in input space. The
follows. The Encoder LSTM runs through a sequence of
proposed solution is to quantize image patches into a large
frames to come up with a representation. This representa-
dictionary and train the model to predict the identity of
tion is then decoded through another LSTM to produce a
the target patch. This does solve some of the problems of
target sequence. We consider different choices of the tar-
squared loss but it introduces an arbitrary dictionary size
get sequence. One choice is to predict the same sequence
into the picture and altogether removes the idea of patches
as the input. The motivation is similar to that of autoen-
being similar or dissimilar to one other. Designing an ap-
coders – we wish to capture all that is needed to reproduce
propriate loss function that respects our notion of visual
the input but at the same time go through the inductive bi-
similarity is a very hard problem (in a sense, almost as hard
ases imposed by the model. Another option is to predict the
as the modeling problem we want to solve in the first place).
future frames. Here the motivation is to learn a representa-
Therefore, in this paper, we use the simple squared loss ob-
tion that extracts all that is needed to extrapolate the motion
jective function as a starting point and focus on designing
and appearance beyond what has been observed. These two
an encoder-decoder RNN architecture that can be used with
natural choices can also be combined. In this case, there are
any loss function.
two decoder LSTMs – one that decodes the representation
into the input sequence and another that decodes the same
representation to predict the future. 2. Model Description
The inputs to the model can, in principle, be any represen- In this section, we describe several variants of our LSTM
tation of individual video frames. However, for the pur- Encoder-Decoder model. The basic unit of our network
poses of this work, we limit our attention to two kinds of is the LSTM cell block. Our implementation of LSTMs
inputs. The first is image patches. For this we use natural follows closely the one discussed by Graves (2013).
image patches as well as a dataset of moving MNIST digits.
The second is high-level “percepts” extracted by applying a 2.1. Long Short Term Memory
convolutional net trained on ImageNet. These percepts are
the states of last (and/or second-to-last) layers of rectified In this section we briefly describe the LSTM unit which is
linear hidden states from a convolutional neural net model. the basic building block of our model. The unit is shown in
Fig. 1 (reproduced from Graves (2013)).
In order to evaluate the learned representations we quali-
tatively analyze the reconstructions and predictions made Each LSTM unit has a cell which has a state ct at time t.
by the model. For a more quantitative evaluation, we use This cell can be thought of as a memory unit. Access to
these LSTMs as initializations for the supervised task of ac- this memory unit for reading or modifying it is controlled
through sigmoidal gates – input gate it , forget gate ft and
Unsupervised Learning with LSTMs
W1 W1 copy W2 W2
v1 v2 v3 v3 v2
W1 W1 copy W2 W2 W2 W2
Learned
copy
Representation
v1 v2 v3 v4 v5 v3 v2
W1 W1
v1 v2 v3
learn trivial mappings for arbitrary length input sequences.
W3 W3
Second, the same LSTM operation is used to decode the
Sequence of Input Frames
representation recursively. This means that the same dy-
namics must be applied on the representation at any stage
of decoding. This further prevents the model from learning Future Prediction v4 v5
an identity mapping.
2.3. LSTM Future Predictor Model Figure 4. The Composite Model: The LSTM predicts the future
as well as the input sequence.
Another natural unsupervised learning task for sequences
is predicting the future. This is the approach used in lan-
target and hence a unimodal target distribution. But for the
guage models for modeling sequences of words. The de-
LSTM Future Predictor there is a possibility of multiple
sign of the Future Predictor Model is same as that of the
targets given an input because even if we assume a deter-
Autoencoder Model, except that the decoder LSTM in this
ministic universe, everything needed to predict the future
case predicts frames of the video that come after the in-
will not necessarily be observed in the input.
put sequence (Fig. 3). Ranzato et al. (2014) use a similar
model but predict only the next frame at each time step. There is also an argument against using a conditional
This model, on the other hand, predicts a long sequence decoder from the optimization point-of-view. There are
into the future. Here again we can consider two variants of strong short-range correlations in video data, for example,
the decoder – conditional and unconditioned. most of the content of a frame is same as the previous one.
If the decoder was given access to the last few frames while
Why should this learn good features?
generating a particular frame at training time, it would find
In order to predict the next few frames correctly, the model
it easy to pick up on these correlations. There would only
needs information about which objects and background are
be a very small gradient that tries to fix up the extremely
present and how they are moving so that the motion can
subtle errors that require long term knowledge about the
be extrapolated. The hidden state coming out from the en-
input sequence. In an unconditioned decoder, this input is
coder will try to capture this information. Therefore, this
removed and the model is forced to look for information
state can be seen as a representation of the input sequence.
deep inside the encoder.
2.4. Conditional Decoder
2.5. A Composite Model
For each of these two models, we can consider two possi-
The two tasks – reconstructing the input and predicting the
bilities - one in which the decoder LSTM is conditioned on
future can be combined to create a composite model as
the last generated frame and the other in which it is not. In
shown in Fig. 4. Here the encoder LSTM is asked to come
the experimental section, we explore these choices quanti-
up with a state from which we can both predict the next few
tatively. Here we briefly discuss arguments for and against
frames as well as reconstruct the input.
a conditional decoder. A strong argument in favour of using
a conditional decoder is that it allows the decoder to model This composite model tries to overcome the shortcomings
multiple modes in the target sequence distribution. With- that each model suffers on its own. A high-capacity au-
out that, we would end up averaging the multiple modes in toencoder would suffer from the tendency to learn trivial
the low-level input space. However, this is an issue only if representations that just memorize the inputs. However,
we expect multiple modes in the target sequence distribu- this memorization is not useful at all for predicting the fu-
tion. For the LSTM Autoencoder, there is only one correct ture. Therefore, the composite model cannot just memo-
Unsupervised Learning with LSTMs
rize information. On the other hand, the future predictor learning, and because we did not want to introduce any un-
suffers form the tendency to store information only about natural bias in the samples. We also used the supervised
the last few frames since those are most important for pre- datasets (UCF-101 and HMDB-51) for unsupervised train-
dicting the future, i.e., in order to predict vt , the frames ing. However, we found that using them did not give any
{vt−1 , . . . , vt−k } are much more important than v0 , for significant advantage over just using the YouTube videos.
some small value of k. Therefore the representation at the
We extracted percepts using the convolutional neural net
end of the encoder will have forgotten about a large part of
model of Simonyan & Zisserman (2014b). The videos
the input. But if we ask the model to also predict all of the
have a resolution of 240 × 320 and were sampled at al-
input sequence, then it cannot just pay attention to the last
most 30 frames per second. We took the central 224 × 224
few frames.
patch from each frame and ran it through the convnet. This
gave us the RGB percepts. Additionally, for UCF-101, we
3. Experiments computed flow percepts by extracting flows using the Brox
method and training the temporal stream convolutional net-
We design experiments to accomplish the following objec-
work as described by Simonyan & Zisserman (2014a). We
tives:
found that the fc6 features worked better than fc7 for sin-
• Get a qualitative understanding of what the LSTM gle frame classification using both RGB and flow percepts.
learns to do. Therefore, we used the 4096-dimensional fc6 layer as the
input representation of our data. Besides these percepts,
• Measure the benefit of initializing networks for super- we also trained the proposed models on 32 × 32 patches of
vised learning tasks with the weights found by unsu- pixels.
pervised learning, especially with very few training
All models were trained using backprop on a single
examples.
NVIDIA Titan GPU. A two layer 2048 unit Composite
• Compare the different proposed models - Autoen- model that predicts 13 frames and reconstructs 16 frames
coder, Future Predictor and Composite models and took 18-20 hours to converge on 300 hours of percepts. We
their conditional variants. initialized weights by sampling from a uniform distribu-
tion whose scale was set to 1/sqrt(fan-in). Biases at all
• Compare with state-of-the-art action recognition the gates were initialized to zero. Peep-hole connections
benchmarks. were initialized to zero. The supervised classifiers trained
on 16 frames took 5-15 minutes to converge. The code
3.1. Datasets can be found at https://github.com/emansim/
unsupervised-videos.
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012) 3.2. Visualization and Qualitative Analysis
contains 13,320 videos with an average length of 6.2 sec-
onds belonging to 101 different action categories. The The aim of this set of experiments to visualize the proper-
dataset has 3 standard train/test splits with the training set ties of the proposed models.
containing around 9,500 videos in each split (the rest are Experiments on MNIST
test). The HMDB-51 dataset (Kuehne et al., 2011) contains We first trained our models on a dataset of moving MNIST
5100 videos belonging to 51 different action categories. digits. In this dataset, each video was 20 frames long and
Mean length of the videos is 3.2 seconds. This also has consisted of two digits moving inside a 64 × 64 patch.
3 train/test splits with 3570 videos in the training set and The digits were chosen randomly from the training set and
rest in test. placed initially at random locations inside the patch. Each
To train the unsupervised models, we used a subset of the digit was assigned a velocity whose direction was chosen
Sports-1M dataset (Karpathy et al., 2014), that contains uniformly randomly on a unit circle and whose magnitude
1 million YouTube clips. Even though this dataset is la- was also chosen uniformly at random over a fixed range.
belled for actions, we did not do any supervised experi- The digits bounced-off the edges of the 64 × 64 frame and
ments on it because of logistical constraints with working overlapped if they were at the same location. The reason
with such a huge dataset. We instead collected 300 hours for working with this dataset is that it is infinite in size and
of video by randomly sampling 10 second clips from the can be generated quickly on the fly. This makes it possi-
dataset. It is possible to collect better samples if instead of ble to explore the model without expensive disk accesses
choosing randomly, we extracted videos where a lot of mo- or overfitting issues. It also has interesting behaviours due
tion is happening and where there are no shot boundaries. to occlusions and the dynamics of bouncing off the walls.
However, we did not do so in the spirit of unsupervised
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -
Figure 5. Reconstruction and future prediction obtained from the Composite Model on a dataset of moving MNIST digits.
We first trained a single layer Composite Model. Each 2048 units. We found that the reconstructions and the pre-
LSTM had 2048 units. The encoder took 10 frames as in- dictions are both very blurry. We then trained a bigger
put. The decoder tried to reconstruct these 10 frames and model with 4096 units. The outputs from this model are
the future predictor attempted to predict the next 10 frames. also shown in Fig. 6. We can see that the reconstructions
We used logistic output units with a cross entropy loss func- get much sharper.
tion. Fig. 5 shows two examples of running this model.
Generalization over time scales
The true sequences are shown in the first two rows. The
In the next experiment, we test if the model can work
next two rows show the reconstruction and future predic-
at time scales that are different than what it was trained
tion from the one layer Composite Model. It is interesting
on. We take a one hidden layer unconditioned Compos-
to note that the model figures out how to separate superim-
ite Model trained on moving MNIST digits. The model
posed digits and can model them even as they pass through
has 2048 LSTM units and looks at a 64 × 64 input. It
each other. This shows some evidence of disentangling the
was trained on input sequences of 10 frames to reconstruct
two independent factors of variation in this sequence. The
those 10 frames as well as predict 10 frames into the fu-
model can also correctly predict the motion after bounc-
ture. In order to test if the future predictor is able to gen-
ing off the walls. In order to see if adding depth helps,
eralize beyond 10 frames, we let the model run for 100
we trained a two layer Composite Model, with each layer
steps into the future. Fig. 7(a) shows the pattern of ac-
having 2048 units. We can see that adding depth helps the
tivity in the LSTM units of the future predictor pathway
model make better predictions. Next, we changed the fu-
for a randomly chosen test input. It shows the activity
ture predictor by making it conditional. We can see that
at each of the three sigmoidal gates (input, forget, out-
this model makes sharper predictions.
put), the input (after the tanh non-linearity, before being
Experiments on Natural Image Patches multiplied by the input gate), the cell state and the final
Next, we tried to see if our models can also work with nat- output (after being multiplied by the output gate). Even
ural image patches. For this, we trained the models on se- though the units are ordered randomly along the vertical
quences of 32 × 32 natural image patches extracted from axis, we can see that the dynamics has a periodic quality
the UCF-101 dataset. In this case, we used linear output to it. The model is able to generate persistent motion for
units and the squared error loss function. The input was long periods of time. In terms of reconstruction, the model
16 frames and the model was asked to reconstruct the 16 only outputs blobs after the first 15 frames, but the motion
frames and predict the future 13 frames. Fig. 6 shows the is relatively well preserved. More results, including long
results obtained from a two layer Composite model with range future predictions over hundreds of time steps can see
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -
Figure 6. Reconstruction and future prediction obtained from the Composite Model on a dataset of natural image patches. The first two
rows show ground truth sequences. The model takes 16 frames as inputs. Only the last 10 frames of the input sequence are shown here.
The next 13 frames are the ground truth future. In the rows that follow, we show the reconstructed and predicted frames for two instances
of the model.
been at http://www.cs.toronto.edu/˜nitish/ look like higher frequency strips. It is conceivable that the
unsupervised_video. To show that setting up a pe- high frequency features help in encoding the direction and
riodic behaviour is not trivial, Fig. 7(b) shows the activ- velocity of motion.
ity from a randomly initialized future predictor. Here, the
Fig. 10 shows the output features from the two LSTM de-
LSTM state quickly converges and the outputs blur com-
coders of a Composite Model. These correspond to the
pletely.
weights connecting the LSTM output units to the output
Out-of-domain Inputs layer. They appear to be somewhat qualitatively different
Next, we test this model’s ability to deal with out-of- from the input features shown in Fig. 9. There are many
domain inputs. For this, we test the model on sequences more output features that are local blobs, whereas those are
of one and three moving digits. The model was trained on rare in the input features. In the output features, the ones
sequences of two moving digits, so it has never seen in- that do look like strips are much shorter than those in the
puts with just one digit or three digits. Fig. 8 shows the input features. One way to interpret this is the following.
reconstruction and future prediction results. For one mov- The model needs to know about motion (which direction
ing digit, we can see that the model can do a good job but and how fast things are moving) from the input. This re-
it really tries to hallucinate a second digit overlapping with quires precise information about location (thin strips) and
the first one. The second digit shows up towards the end velocity (high frequency strips). But when it is generating
of the future reconstruction. For three digits, the model the output, the model wants to hedge its bets so that it does
merges digits into blobs. However, it does well at getting not suffer a huge loss for predicting things sharply at the
the overall motion right. This highlights a key drawback of wrong place. This could explain why the output features
modeling entire frames of input in a single pass. In order to have somewhat bigger blobs. The relative shortness of the
model videos with variable number of objects, we perhaps strips in the output features can be explained by the fact that
need models that not only have an attention mechanism in in the inputs, it does not hurt to have a longer feature than
place, but can also learn to execute themselves a variable what is needed to detect a location because information is
number of times and do variable amounts of computation. coarse-coded through multiple features. But in the output,
the model may not want to put down a feature that is bigger
Visualizing Features
than any digit because other units will have to conspire to
Next, we visualize the features learned by this model.
correct for it.
Fig. 9 shows the weights that connect each input frame to
the encoder LSTM. There are four sets of weights. One
set of weights connects the frame to the input units. There 3.3. Action Recognition on UCF-101/HMDB-51
are three other sets, one corresponding to each of the three The aim of this set of experiments is to see if the features
gates (input, forget and output). Each weight has a size of learned by unsupervised learning can help improve perfor-
64 × 64. A lot of features look like thin strips. Others
Unsupervised Learning with LSTMs
0 Input Gates 0 Forget Gates 0 Input 0 Output Gates 0 Cell States 0 Output
50 50 50 50 50 50
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Figure 8. Out-of-domain runs. Reconstruction and Future prediction for test sequences of one and three moving digits. The model was
trained on sequences of two moving digits.
Figure 9. Input features from a Composite Model trained on moving MNIST digits. In an LSTM, each input frame is connected to four
sets of units - the input, the input gate, forget gate and output gate. These figures show the top-200 features ordered by L2 norm of the
input features. The features in corresponding locations belong to the same LSTM unit.
Figure 10. Output features from the two decoder LSTMs of a Composite Model trained on moving MNIST digits. These figures show
the top-200 features ordered by L2 norm.
Unsupervised Learning with LSTMs
Fig. 12 compares three models - single frame classifier Table 2. Future prediction results on MNIST and image patches.
(logistic regression), baseline LSTM classifier and the All models use 2 layers of LSTMs.
LSTM classifier initialized with weights from the Com-
posite Model as the number of labelled videos per class is present results from these two analyses.
varied. Note that having one labelled video means having
Future prediction results are summarized in Table 2. For
many labelled 16 frame blocks. We can see that for the case
MNIST we compute the cross entropy of the predictions
of very few training examples, unsupervised learning gives
with respect to the ground truth, both of which are 64 ×
a substantial improvement. For example, for UCF-101, the
64 patches. For natural image patches, we compute the
performance improves from 29.6% to 34.3% when train-
squared loss. We see that the Composite Model always
ing on only one labelled video. As the size of the labelled
does a better job of predicting the future compared to the
dataset grows, the improvement becomes smaller. Even for
Future Predictor. This indicates that having the autoen-
the full UCF-101 dataset we still get a considerable im-
coder along with the future predictor to force the model
provement from 74.5% to 75.8%. On HMDB-51, the im-
to remember more about the inputs actually helps predict
provement is from 42.8% to 44.0% for the full dataset (70
the future better. Next, we can compare each model with
videos per class) and 14.4% to 19.1% for one video per
its conditional variant. Here, we find that the conditional
class. Although, the improvement in classification by us-
models perform better, as was also noted in Fig. 5.
ing unsupervised learning was not as big as we expected,
we still managed to yield an additional improvement over Next, we compare the models using performance on a su-
a strong baseline. We discuss some avenues for improve- pervised task. Table 3 shows the performance on action
ments later. recognition achieved by finetuning different unsupervised
learning models. Besides running the experiments on the
We further ran similar experiments on the optical flow per-
full UCF-101 and HMDB-51 datasets, we also ran the ex-
cepts extracted from the UCF-101 dataset. A temporal
periments on small subsets of these to better highlight the
stream convolutional net, similar to the one proposed by Si-
case where we have very few training examples. We find
monyan & Zisserman (2014b), was trained on single frame
that all unsupervised models improve over the baseline
optical flows as well as on stacks of 10 optical flows. This
LSTM which is itself well-regularized by using dropout.
gave an accuracy of 72.2% and 77.5% respectively. Here
The Autoencoder model seems to perform consistently bet-
again, our models took 16 frames as input, reconstructed
ter than the Future Predictor. The Composite model which
them and predicted 13 frames into the future. LSTMs with
combines the two does better than either one alone. Con-
128 hidden units improved the accuracy by 2.1% to 74.3%
ditioning on the generated inputs does not seem to give a
for the single frame case. Bigger LSTMs did not improve
clear advantage over not doing so. The Composite Model
results. By pretraining the LSTM, we were able to further
with a conditional future predictor works the best, although
improve the classification to 74.9% (±0.1). For stacks of
its performance is almost same as that of the Composite
10 frames we improved very slightly to 77.7%. These re-
Model.
sults are summarized in Table 1.
3.5. Comparison with Other Action Recognition
3.4. Comparison of Different Model Variants
Benchmarks
The aim of this set of experiments is to compare the dif-
Finally, we compare our models to the state-of-the-art ac-
ferent variants of the model proposed in this paper. Since
tion recognition results. The performance is summarized in
it is always possible to get lower reconstruction error by
Table 4. The table is divided into three sets. The first set
copying the inputs, we cannot use input reconstruction er-
compares models that use only RGB data (single or mul-
ror as a measure of how good a model is doing. However,
tiple frames). The second set compares models that use
we can use the error in predicting the future as a reasonable
explicitly computed flow features only. Models in the third
measure of how good the model is doing. Besides, we can
set use both.
use the performance on supervised tasks as a proxy for how
good the unsupervised model is doing. In this section, we On RGB data, our model performs at par with the best deep
Unsupervised Learning with LSTMs
80 50
45
70
40
Classification Accuracy
Classification Accuracy
60 35
30
50
25
40 20
Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
models. It performs 3% better than the LRCN model that HMDB-
Method UCF-101
also used LSTMs on top of convnet features1 . Our model 51
performs better than C3D features that use a 3D convolu- Spatial Convolutional Net (Simonyan &
73.0 40.5
tional net. However, when the C3D features are concate- Zisserman, 2014a)
nated with fc6 percepts, they do slightly better than our C3D (Tran et al., 2014) 72.3 -
model. C3D + fc6 (Tran et al., 2014) 76.4 -
LRCN (Donahue et al., 2014) 71.1 -
The improvement for flow features over using a randomly Composite LSTM Model 75.8 44.0
initialized LSTM network is quite small. We believe this is Temporal Convolutional Net (Simonyan &
83.7 54.6
atleast partly due to the fact that the flow percepts already Zisserman, 2014a)
capture a lot of the motion information that the LSTM LRCN (Donahue et al., 2014) 77.0 -
would otherwise discover. Composite LSTM Model 77.7 -
LRCN (Donahue et al., 2014) 82.9 -
When we combine predictions from the RGB and flow Two-stream Convolutional Net (Simonyan &
models, we obtain 84.3 accuracy on UCF-101. We believe 88.0 59.4
Zisserman, 2014a)
further improvements can be made by running the model Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
over different patch locations and mirroring the patches. Composite LSTM Model 84.3 -
Also, our model can be applied deeper inside the convnet
instead of just at the top-level. That can potentially lead to Table 4. Comparison with state-of-the-art action recognition
further improvements. In this paper, we focus on showing models.
that unsupervised training helps consistently across both 4. Conclusions
datasets and across different sized training sets. We proposed models based on LSTMs that can learn good
1 video representations. We compared them and analyzed
However, the improvement is only partially from unsuper-
vised learning, since we used a better convnet model.
their properties through visualizations. Moreover, we man-
aged to get an improvement on supervised tasks. The best
performing model was the Composite Model that combined
an autoencoder and a future predictor. Conditioning on
generated outputs did not have a significant impact on the
Unsupervised Learning with LSTMs
performance for supervised tasks, however it made the fu- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T.
ture predictions look slightly better. The model was able to HMDB: a large video database for human motion recognition.
persistently generate motion well beyond the time scales it In Proceedings of the International Conference on Computer
Vision (ICCV), 2011.
was trained for. However, it lost the precise object features
rapidly after the training time scale. The features at the in- Lan, Zhen-Zhong, Lin, Ming, Li, Xuanchong, Hauptmann,
put and output layers were found to have some interesting Alexander G., and Raj, Bhiksha. Beyond gaussian pyramid:
Multi-skip feature stacking for action recognition. CoRR,
properties. abs/1411.6660, 2014.
To further get improvements for supervised tasks, we be- Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning hi-
lieve that the model can be extended by applying it convo- erarchical spatio-temporal features for action recognition with
lutionally across patches of the video and stacking multiple independent subspace analysis. In CVPR, 2011.
layers of such models. Applying this model in the lower Memisevic, Roland. Learning to relate images. IEEE Trans-
layers of a convolutional net could help extract motion in- actions on Pattern Analysis and Machine Intelligence, 35(8):
formation that would otherwise be lost across max-pooling 1829–1846, 2013.
layers. In our future work, we plan to build models based Memisevic, Roland and Hinton, Geoffrey E. Learning to represent
on these autoencoders from the bottom up instead of apply- spatial transformations with factored higher-order boltzmann
ing them only to percepts. machines. Neural Computation, 22(6):1473–1492, June 2010.