Seminar Report On
Multimodal Deep Learning
Submitted in Partial Fulfilment of the
Requirments for the award of
Bachelor of Technology in
Computer Science and Engineering
(2013-2017)
Submitted By:
Sangeetha Mathew
Roll No. 13028102
Guide:
...Ms.Divya Madhu
...Department of CSE
Muthoot Institute of Technology and Science (MITS)
Varikoli P.O, Puthencruz- 682308
MUTHOOT INSTITUTE OF TECHNOLOGY &
SCIENCE
Varikoli P.O, Puthencruz- 682308
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CERTIFICATE
This is to certify that the Seminar report entitled
”Multimodal Deep Learning” submitted by Sangeetha Mathew
(13028102) of Semester VII is a bonafide account of the work done by her
under our supervision.
Guide
Ms.Divya Madhu
Dept. of CSE
Head of the Department
Dr. Sanju V.
Dept. of CSE
Acknowledgements
I respect and thank Dr.RamKumar.S, Principal of MITS for giving
me the opportunity to do this seminar.
I would like to sincerely thank my guide Ms.Divya Madhu, Asst. Prof,
CSE, for her support and valuable guidance. Her timely advice, meticulous scrutiny, scholarly and scientific approach has helped me complete this
project on time.
I am thankful to our seminar coordinator Asst. Prof.Jency Thomas,
for their insight, supervision and encouragement which has helped me to
complete this project on time. Also I would like to thank our college, MITS,
for providing the best facility and support.
I thank each and every staff of the Computer Science Department for
lending us help and support.I express my heartfelt veneration to all who had
been helpful and inspiring throughout this endeavour. Last but not the least,
I thank the almighty for blessing me for completing the seminar.
Sangeetha Mathew
i
Abstract
Deep learning is a new area of machine learning research that imitates the
way the human brain works. It has a great number of successful applications
in speech recognition, image classification, and natural language processing.
It is a particular approach to build and train neural networks. A deep neural
network consists of a hierarchy of layers, whereby each layer transforms the
input data into more abstract representations. Deep networks have been
successfully applied to unsupervised and supervised feature learning for single
modalities like text, images or audio. As the developments in technology,
an application of deep networks to learn features over multiple modalities
has surfaced. It involves relating information from multiple sources. The
relevance of multi-modality has enhanced tremendously due to extensive use
of social media and online advertising. Social media has been a convenient
platform for voicing opinions from posting messages to uploading a media
fille, or any combination of messages.There are a number of methods that
can be used for multimodal deep learning, but the most efficient one is Deep
Boltzmann Machine (DBM). The DBM is a fully generative model which can
be utilized for extracting features from data with certain missing modalities.
DBM is constructed by stacking one Gaussian RBM and one standard binary
RBM. An RBM has three components: visible layer, hidden layer, and a
weight matrix containing the weights of the connections between visible and
hidden units. There are no connections between the visible units or between
the hidden units. That is the reason why this model is called restricted.
ii
Contents
1 Introduction
v
2 Literature Survey
2
3 Traditional Models
5
4 Architectures
7
5 Algorithms
11
6 Methodology
14
7 Conclusion
26
iii
List of Figures
3.1
Audio-Visual User Recognition Systems proposed . . . . . . .
6
4.1
Comparison of model structures . . . . . . . . . . . . . . . . .
8
6.1
6.2
RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Left:A three-layer Deeep Belief Network and a three-layer Deep
Boltzmann Machine.Right:Pretraining consist of learning a
stack of modified RBM’s . . . . . . . . . . . . . . . . . . . . .
Multimodal DBM . . . . . . . . . . . . . . . . . . . . . . . . .
Multimodal learning setting . . . . . . . . . . . . . . . . . . .
15
6.3
6.4
iv
16
17
21
Chapter 1
Introduction
Multimodal sensing and processing have shown promising results in detection, recognition and identifcation in various applications, such as humancomputer interaction, surveillance, medical diagnosis, biometrics, etc. There
are many ways to generate multiple modalities; one is via sensor diversity
(specially in our everyday life tasks) and the other is via feature diversity
(using engineered and/or learned features).
In the last few decades, many machine learning models have been proposed to deal with multimodal data. Here I mostly focus on deep learning
models for multimodal deep learning. The group of multimodal deep learning approaches that was discussed is all based on Restricted Boltzmann Machines (RBMs). These methods include Deep Belief Networks (DBNs), Deep
Boltzmann Machines (DBNs) and Deep Autoencoders. Not only the building blocks of all these models are RBMs, their training algorithms are also
very similar. RBM, the model considered here are a group of non-directed
probabilistic energy-based graphical models that assign a scalar energy value
to each variable configuration. These models are trained in a way that the
plausible configurations are associated with lower energies (higher probabilities). An RBM has three components: visible layer, hidden layer, and a
weight matrix containing the weights of the connections between visible and
hidden units. There are no connections between the visible units or between
the hidden units. That is the reason why this model is called restricted”.
In the seminar I have studied, these methods and have been compared to
more traditional classification approaches such as SVMs and LDA. For that
v
reason, before getting into deep learning models, we first briefly introduce
SVM and LDA and mention a few of their applications in processing and
classifying multimodal data.
1
Chapter 2
Literature Survey
Machine Learning: Take data, train model on data and use the model to
make predictions. Feature Engineering: Art of extracting useful patterns
from data that will make it easier for Machine Learning models to distinguish between classes. Feature Learning: Feature learning algorithm and
the common patterns that are important to distinguish between classes and
extract them automatically to be used in a classification or regression process. Deep Learning: The term deep learning originated from new methods
and strategies designed to generate deep hierarchies of non-linear features
by overcoming the problems with vanishing gradients so that we can train
architectures with dozens of layers of non-linear hierarchical features. Affective computation has been extensively studied in the last decades, and many
methods are proposed for handling various media types including textual
documents, images , music and movies. Two widely investigated tasks are
emotion detection and sentiment analysis. Both of them are standard classification problems with different state spaces. Usually emotion detection is
defined on several discrete emotions, such as anger, sadness, joy etc., while
sentiment analysis aims at categorizing data into positive or negative.
Affective computation has been extensively studied in the last decades,
and many methods are proposed for handling various media types including
textual documents , images , music and movies . Two widely investigated
tasks are emotion detection and sentiment analysis. Both of them are standard classification problems with different state spaces. Usually emotion
detection is defined on several discrete emotions, such as anger, sadness, joy
etc., while sentiment analysis aims at categorizing data into positive or neg2
ative. Since the adopted techniques of these two tasks are quite similar, we
will not differentiate them in this section. Previous efforts are summarized
mainly based on the modality of the data they are working on.
For textual data, lexicon-based approach using a set of pre-defined emotional words or icons has been proved to be an effective way.Researches propose to predict the sentiment of tweets by using the emoticons (e.g., positive
emoticon “:)” and negative one “: (”) and acronyms [e.g., lol (laugh out
loudly), gr8 (great) and rotf (rolling on the floor)]. A partial tree kernel
is adopted to combine the emoticons, acronyms and Part-of-Speech (POS)
tags. Three lexicon emotion dictionaries and POS tags are leveraged to extract linguistic features from the textual documents. A semantic feature
is proposed to address the sparsity of microbloggings. The non-appeared
entities are inferred using a pre-defined hierarchical entity structure. For example, “iPad” and “iPhone” indicate the appearance of “Product/Apple”.
Furthermore, the latent sentiment topics are extracted and the associated
sentiment tweets are used to augment the original feature space.A set of sentimental aspects, such as opinion strength, emotion and polarity indicators,
are combined as meta-level features for boosting the sentiment classification
on Twitter messages.
Affective analysis of images adopts a similar framework with general concept detection. In SentiBank, a set of visual concept classifiers, which are
strongly related to emotions and sentiments, are trained based on unlabeled
Web images. Then, a SVM classifier is built upon the output scores of these
concept classifiers. The performance of SentiBank is recently improved by
using deep convolution neural network (CNN). Nevertheless, the utility of
SentiBank is limited by the number and kind of concepts (or ANPs). Due to
the fact that ANPs are visually emotional concepts, selection of right samples for classifier training could be subjective. In addition to the semantic
level features, a set of low-level features, such as color-histogram and visual aesthetics, are also adopted. The combined features are then fed into
a multi-task regression model for emotion prediction. Hand-crafted features
derived from principles-of-art such as balance and harmony are proposed for
recognition of image emotion. The deep CNN is directly used for training
sentiment classifiers rather than using a mid-level consisting of some general
concepts. Since Web images are weakly labeled, the system progressively
select a subset of the training instances with relatively distinct sentiment
3
labels to reduce the impact of noisy training instances.
For emotional analysis of music, various hand-crafted features corresponding to different aspects (e.g., melody, timbre and rhythm) of music are proposed. In [19], the early fused features are characterized by cosine radial
basis function A ListNet layer is added on top of the RBF layer for ranking
the music in valence and arousal in Cartesian coordinates. Besides handcrafted features, the authors adopt deep belief networks (DBN) on the Discrete Fourier Transforms (DFTs) of music signals. Then, SVM classifiers are
trained on the latent features from hidden layers.
In the video domain, most research efforts are dedicated to movies. A
large emotional dataset, which contains about 9,800 movie clips, is constructed. SVM classifiers are trained on different low-level features, such
as audio features, complexity and color harmony. Then, late fusion is employed to combine the classifiers. A set of features are proposed based on
psychology and cinematography for affective understanding in movies. Early
fusion is adopted to combine the extracted features. Other fusion strategies
on auditory and visual modalities are studied, a hierarchical architecture is
proposed for predicting both emotion intensity and emotion types.
CRF is adopted to model the temporal information in the video sequence. In
addition to movies, a large-scale Web video dataset for emotion analysis is
recently proposed , where a simplified multi-kernel SVM is adopted to combine the features from different modalities.
Different from those works, the approach proposed in this paper is a fully
generative model, which defines a joint representation for various features
extracted in different modalities. More importantly, the joint representation
conveying information from multiple modalities can still be generated when
some modalities are missing, which means that our model does not restrict
to the media types of user generated contents.
4
Chapter 3
Traditional Models
SVM and LDA for Multimodal Data
Several groups of researchers have proposed multimodal classification and
data fusion SVM and LDA-based approached. The authors of [???] claim
that existing multi-biometric fusion techniques face a number of limitations
since they are based on the assumptions that each biometric modality is
local, complete, and static. These limitations are particularly pronounced
when considered in the context of biometric identification, as opposed to
verification. Key limitations include:
1. Each registered person must be entered into every modality. This may not
be plausible and is very restrictive. Moreover, this makes adding additional
modalities to an existing system difficult or impossible.
2. All of the classifiers must always be available. This will not be the case
if the modalities are part of a distributed system, such as when a multibiometric fusion may degrade as individuals are later added to or removed
from the system.
3. Limited to verification. Due to the other limitations listed above, most
existing fusion techniques are explicitly designed for verification only – identification is not supported.
They propose a novel multi-biometric fusion technique that addresses the
issues listed above and is suitable for both identification and verification. A
mediator agent controls the fusion of the individual biometric match scores,
using a “bank” of SVMs that cover all possible subsets of the biometric
modalities being considered. This agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available and have
sensor data for the identity in question. This fusion technique differs from a
5
traditional SVM ensemble – rather than combining the output of all of the
SVMs, we apply only the SVM that best corresponds to the available modalities. The mediator agent also controls the learning of new SVMs when
modalities are added to the system or sufficient changes have been made to
the data in existing modalities. The experiments utilize the following biometric modalities: face, fingerprint, and DNA profile data. We empirically
show that our multiple SVM technique produces more accurate results than
the traditional single SVM approach.The pipeline of this approach is shown
in figure below
Figure 3.1: Audio-Visual User Recognition Systems proposed
Comparisons and Discussions
Both Linear Discriminant Analysis and Support Vector Machines compute
hyperplanes that are optimal with respect to their individual objectives.
However, there can be vast differences in performance between the two techniques depending on the extent to which their respective assumptions agree
with problems at hand.
It’s true that LDA and linear SVM share much in common, both draw a line,
the technical differences between these two is significant. As a very informal
explanation, LDA draws lines, while SVM can be nonlinear and draw curves
instead. Also, as the linear SVM is a super-class (or generalization of) LDA,
it is generally the better or more sophisticated approach.
6
Chapter 4
Architectures
There are a number of deep learning models for multimodal sensing and
processing. The first group of multimodal deep learning approaches that
we study are all based on Restricted Boltzmann Machines (RBMs). These
methods include Deep Belief Networks (DBNs), Deep Boltzmann Machines
(DBNs) and Deep Autoencoders. Not only the building blocks of all these
models are RBMs, their training algorithms are also very similar.
4.1 Boltzmann Machine
A Boltzmann Machine is a network of symmetrically connected neuron
like units that take stochastic decisions about whether to turn on or off.
These neurons include both hidden and visible units. An energy function
is used for their activation. They are one of the first examples of a neural
network capable of learning internal representations, and are able to represent
and solve difficult combinatorial problems. However due to a number of
issues such as the machine seems to stop learning as it is scaled up owing
to exponential time requirements with increase in machine size and number
of connections between neurons and noise causing the connection strength
to randomize. So Boltzmann Machines with unrestricted connectivity are
not much used in machine learning. However Boltzmann Machines with
restrictions on connectivity between neurons are useful in field of ML. This
is since they are free of these issues we discussed
7
4.2 Restricted Boltzmann Machine(RBM)
RBM is a kind of Boltzmann Machine whose connections are restricted
with the simple rule that it’s neurons must form a bipartite graph i.e. its
neurons should be dividable into 2 disjoint sets. RBMs perform a kind of factor analysis on input data,extracting a smaller set of hidden variables, that
can be used as data representation. It is different from other representation
algorithms in Machine Learning(ML) due to 2 things.
Stochastic
Generative
Being stochastic means it’s neuron values are calculated based on probability distribution. Since it is generative, it can generate data on its own after
learning.
Figure 4.1: Comparison of model structures
4.3 Deep Boltzmann Machine(DBM)
A DBM is a deep multimodal Boltzmann Machine with restrictions. DBM’s
have the potential of learning internal representations that become increasingly complex, which is considered to be a promising way of solving object
8
and speech recognition problems. Their training can be done from a large
supply of unlabeled sensory inputs and very limited labeled data can then be
used to only slightly fine tune the model for a specific task at hand. DBMs
also handle ambiguous inputs more robustly. This is since they incorporate
top-down feedback for the training procedure.As we can see from figure a
DBM may consist of several RBMs connected together.
4.4 Comparison
A naive approach for multimodal deep learning is to concatenate the data descriptors from different input sources to construct a single high-dimensional
feature vector and use it to solve a unimodal representation learning problem.However, the correlation between features in each data modality is much
stronger than that between data modalities. As a result, the learning algorithms are easily tempted to learn dominant patterns in each data modality
separately while giving up learning patterns that occur simultaneously in
multiple data modalities.
To resolve this issue, deep learning methods, such as deep autoencoders
or deep Boltzmann machines (DBM), have been adapted, where the common strategy is to learn joint representations that are shared across multiple
modalities at the higher layer of the deep network, after learning layers of
modality-specific networks. The rationale is that the learned features may
have less within-modality correlation than raw features, and this makes it
easier to capture patterns across data modalities. This has shown promise,
but there still remains the challenging question of how to learn associations
between multiple heterogeneous data modalities so that we can effectively
deal with missing data modalities at testing time.
One necessary condition for a good generative model of multimodal data
is the ability to predictor reason about missing data modalities given partial observation. Honglak Lee’s research group at the University of Michigan
have proposed a new approach to satisfy this condition and improve multimodal deep learning. There emphasis is on efficiently learning associations
between heterogeneous data modalities. According to their study, the data
from multiple sources are semantically correlated and provide complementary information about each other and a good multimodal model must be
able to generate a missing data modality given the rest of the modalities.
9
They propose a novel leaning framework that explicitly aims at this goal by
training the model to minimize the Variation of Information (VI) instead of
maximizing the likelihood.
10
Chapter 5
Algorithms
5.1 Contrastive Divergence
In my paper, the learning process of proposed model is split into two
phases. In the first phase, each RBM component of the proposed multimodal
DBM is pre-trained by using the greedy layerwise pretraining strategy. Under this scheme, we will train each layer, with a set of different parameters
and choose the best performing parameter set for the model. For this a
contrastive divergence(CD) algorithm is utilized, since the time complexity
for computation increases with number of neurons. The 1-step contrastive
divergence(CD1) algorithm is widely used for RBM training, to perform approximate learning for learning parameters. CD allows us to approximate
the gradient of energy function. The approximation of the gradient is based
on a Markov chain.
In CD1 algorithm a Markov chain is run for one full step and them the parameters are modified to reduce the likelihood of chain wandering of from the
initial distribution. This reduces the time and computational effort since we
are not waiting the chain to run to equilibrium state and comparing initial
and final distribution. The distribution generated by Markov Chain can be
thought approximately as the distribution generated by RBMs since they
both alter the energy function. The one step running of the Markov Chain
is since the results of that single step itself would give us the direction of
change of the parameters(gradient).
The CD1 actually performs poorly in approximating the size of the change in
parameters. However, it is accurate enough for learning a RBM to provide
hidden features for a high-level RBM training. This is since CD1 retains
11
most of the information about inputs, as it involves single step calculations.
The greedy layer-by-layer pretraining algorithm relies on learning a stack of
RBM’s with a small modification. The key intuition is that for the lowerlevel RBM to compensate for the lack of top-down input into h1, the input
must be doubled, with the copies of the visible-to-hidden connections tied.
Conversely, for the top-level RBM to compensate for the lack of bottom-up
input into h2, the number of hidden units is doubled. For the intermediate
layers, the RBM weights are simply doubled. The stack of RBMs can then
be trained in a greedy layer-by-layer fashion using the CD algorithm.
5.2 Greedy layerwise pretraining strategy
Greedy layer-wise supervised training A reasonable question to ask is whether
the fact that each layer is trained in an unsupervised way is critical or not.
An alternative algorithm is supervised, greedy and layer-wise: train each
new hidden layer as the hidden layer of a one-hidden layer supervised neural
network NN (taking as input the output of the last of previously trained
layers), and then throw away the output layer of NN and use the parameters
of the hidden layer of NN as pre-training initialization of the new top layer of
the deep net, to map the output of the previous layers to a hopefully better
representation. Pseudo-code for a deep network obtained by training each
layer as the hidden layer of a supervised one-hidden-layer neural network
During each phase of the greedy unsupervised training strategy, layers are
trained to represent the dominant factors of variation extant in the data.
This has the effect of leveraging knowledge of X to form, at each layer, a
representation of X consisting of statistically reliable features of X that can
then be used to predict the output (usually a class label) Y. This perspective
places unsupervised pre-training well within the family of learning strategies
collectively know as semisupervised methods. As with other recent work
demonstrating the effectiveness of semi-supervised methods in regularizing
model parameters, we claim that the effectiveness of the unsupervised pretraining strategy is limited to the extent that learning P(X) is helpful in learning P(Y—X). Here, we find transformations of X—learned features—that are
predictive of the main factors of variation in P(X), and when the pre-training
strategy is effective,2 some of these learned features of X are also predictive
of Y. In the context of deep learning, the greedy unsupervised strategy may
also have a special function. To some degree it resolves the problem of si12
multaneously learning the parameters at all layers by introducing a proxy
criterion. This proxy criterion encourages significant factors of variation,
present in the input data, to be represented in intermediate layers.
13
Chapter 6
Methodology
The learning of our proposed model is not trivial due to multiple layers
of hidden units and multiple modalities.Here the methodology is to split the
learning process into two stages. First, each RBM component of the proposed
multimodal DBM is pretrained by using the greedy layerwise pre- training
strategy. In this stage, the time cost for exactly computing the derivatives
of the probability distributions with respect to parameters increases exponentially with the number of units in the network. Thus, we adopt 1-step
contrastive divergence, an approximate learning method.The second way is
to infer the missing modalities by alternating Gibbs sampling. Meanwhile,
the joint representation is updated with the generated data of missing modalities.
14
Figure 6.1: RBM
The proposed network architecture, which is shown above is com- posed of
three different pathways respectively for visual, auditory and textual modalities. Each pathway is formed by stacking multiple Restricted Boltzmann Machines (RBM), aiming to learn several layers of increasingly complex representations of individual modality.We adopt Deep Boltzmann Machine (DBM)
in multimodal learning framework. Different from other deep networks for
extracting feature, such as Deep Belief Networks (DBN) and denoising Autoencoders (dA), DBM is a fully generative model which can be utilized for
extracting features from data with certain missing modalities.
Additionally, besides the bottom-up information propagation in DBN and
dA, a top-down feedback is also incorporated in DBM, which makes the
DBM more stable on missing or noisy inputs such as weakly labeled data
on the Web. The pathways eventually meet and the sophisticated non-linear
relationships among three modalities are jointly learned. The final joint represented in an unified way.Every RBM tries to optimize its energy function
in order to maximize the probability of the training data.
DBNs can be trained using the CD algorithm to extract a deep hierarchical
representation of the training data. During the learning process, the DBN is
first trained one layer at a time, in a greedy unsupervised manner, by treating the values of hidden units in each layer as the training data for the next
layer (except for the first layer, which is fed with the raw input data). This
learning procedure, called pre-training, finds a set of weights that determine
how the variables in one layer depend on the variables in the layer above.
15
These parameters capture the structural properties of the training data. If
the network is to be used for a classification task, then a supervised discriminative fine-tuning is performed by adding an extra layer of output units and
back-propagating the error derivatives (using some form of stochastic gradient descent, or SGD).
Figure 6.2: Left:A three-layer Deeep Belief Network and a three-layer Deep
Boltzmann Machine.Right:Pretraining consist of learning a stack of modified
RBM’s
To generate a sample from the DBN, we need to perform Gibbs sampling
for a long time between the top two layers h1 and h2 until we converge to
a sample of the h2 layer, then traverse the rest of the DBN in a top-down
manner using the conditional probability distributions to generate the desired sample at the visible layer.
Erhan et al. (2009) studies the reasons why pre-trained deep networks work
much better than traditional neural networks and proposes several possible
explanations. One possible explanation is that pre-training initializes the
parameters of the network in an area of parameter space where optimization
is easier and better local optima is found. This is equivalent to penalizing
solutions that are outside a particular region of the solution space. Another
explanation is that pre-training acts as a kind of regularizer that minimizes
the variance and introduces a bias towards configurations of the parameters
that stochastic gradient descent can explore during the supervised learning phase, by defining a data-dependent prior on the parameters obtained
through the unsupervised learning. In other words, pre-training implicitly
imposes constraints on the parameters of the network to specify which minimum out of all local minima of the objective function is desired. The effect
16
of pre-training relies on the assumption that the true target conditional distribution are structure with the input distribution .
6.1 Multimodal DBM
Figure 6.3: Multimodal DBM
Figure above shows the proposed network architecture, which is composed of
three different pathways respectively for visual, auditory and textual modalities. Each pathway is formed by stacking multiple Restricted Boltzmann Machines (RBM), aiming to learn several layers of increasingly complex representations of individual modality. Similar to [23], we adopt Deep Boltzmann
Machine (DBM) in our multimodal learning framework. Different from other
deep networks for extracting feature, such as Deep Belief Networks (DBN)
[24] and denoising Autoencoders (dA) [25], DBM is a fully generative model
which can be utilized for extracting features from data with certain missing
modalities. Additionally, besides the bottom-up information propagation in
DBN and dA, a top-down feedback is also incorporated in DBM, which makes
17
the DBM more stable on missing or noisy inputs such as weakly labeled data
on the Web. The pathways eventually meet and the sophisticated non-linear
relationships among three modalities are jointly learned. The final joint representation can be viewed as a shared embedded space, where the features
with very different statistical properties from different modalities can be represented in an unified way.
Visual Pathway The visual input consists of five complementary low-level
features widely used in previous works. As shown in Figure, each feature
is modeled with a separate two-layer DBM.Pathway denote the set of five
features, respectively as DenseSIFT, GIST, HOG, LBP and SSIM.
Auditory Pathway The input features adopted in auditory pathway are
MFCC and Audio-Six (i.e., Energy Entropy, Signal Energy, Zero Crossing
Rate, Spectral Rolloff, Spectral Centroid, and Spectral Flux). The AudioSix descriptor, which can capture different aspects of an audio signal, is
expected to be complementary to the MFCC. Since the dimension of AudioSix is only six, we directly concatenate the MFCC feature with Audio-Six
rather than separating them into two sub-pathways as the design in visual
pathway. The correlation between these two features can be learned by the
deep architecture of DBM. Let denote the real-valued auditory features and
and represent h1 and h2(hidden layers) the first and second hidden layers
respectively.The DBM is constructed by stacking one Gaussian RBM and
one standard binary RBM.
Textual Pathway Different from the visual and auditory modalities, the inputs of the textual pathway are discrete values (i.e., count of words). Thus,
we use Replicated Softmax to model the distribution over the word count
vectors. Let one visible unit denoting the associated metadata (i.e., title and
description) of a video , and denotes the count of the th word in a pre-defined
dictionary containing words.
6.2 Modeling Tasks
Generating Missing Modalities: As argued in the introduction, many
real-world applications will often have one or more modalities missing. The
Multimodal DBM can be used to generate such missing data modalities by
clamping the observed modalities at the inputs and sampling the hidden
modalities from the conditional distribution by running the standard alternating Gibbs sampler
Inferring Joint Representations: The model can also be used to generate
18
a fused representation that multiple data modalities. This fused representation is inferred by clamping the observed modalities and doing alternating
Gibbs sampling to sample fromtwo layers (if both modalities are present) or
from other (if text is missing).
This representation can then be used to do information retrieval for multimodal or unimodal queries. Each data point in the database (whether
missing some modalities or not) can be mapped to this latent space. Queries
can also be mapped to this space and an appropriate distance metric can be
used to retrieve results that are close to the query.
Discriminative Tasks:Classifiers such as SVMs can be trained with these
fused representations as inputs. Alternatively, the model can be used to
initialize a feed forward network which can then be finetuned. In our experiments, logistic regression was used to classify the fused representations.
Unlike finetuning, this ensures that all learned representations that we compare (DBNs, DBMs and Deep Autoencoders) use the same discriminative
model.
6.3 Classification Tasks
Multimodal Inputs: Our first set of experiments, evaluate the DBM as a
discriminative model for multimodal data. For each model that we trained,
the fused representation of the data was extracted and feed to a separate
logistic regression for each of the 38 topics. The text input layer in the
DBM was left unclamped when the text was missing. Fig. 4 summarizes the
Mean Average Precision (MAP) and precision@50 (precision at top 50 predictions) obtained by different models. Linear Discriminant Analysis (LDA)
and Support Vector Machines (SVMs) [2] were trained using the labeled data
on concatenated image and text features that did not include SIFT-based
features. Hence, to make a fair comparison, our model was first trained
using only labeled data with a similar set of features (i.e., excluding our
SIFT-based features). We call this model DBM-Lab. Fig. 4 shows that the
DBM-Lab model already outperforms its competitor SVM and LDA models.
DBMLab achieves a MAP of 0.526, compared to 0.475 and 0.492, achieved
by SVM and LDA models.To measure the effect of using unlabeled data, a
DBM was trained using all the unlabeled examples that had both modalities
present. We call this model DBM-Unlab. The only difference between the
DBM-Unlab and DBM-Lab models is that DBM-Unlab used unlabeled data
during its pretraining stage. The input features for both models remained the
19
same. Not surprisingly, the DBM-Unlab model significantly improved upon
DBM-Lab achieving a MAP of 0.585. Our third model, DBM, was trained
using additional SIFT-based features. Adding these features improves the
MAP to 0.609. We compared our model to two other deep learning models:
Multimodal Deep Belief Network (DBN) and a deep Autoencoder model.
These models were trained with the same number of layers and hidden units
as the DBM. The DBN achieves a MAP of 0.599 and the autoencoder gets
0.600. Their performance was comparable but slightly worse than that of the
DBM. In terms of precision@50, the autoencoder performs marginally better
than the rest. We also note that the Multiple Kernel Learning approach
proposed in Guillaumin et. al.achieves a MAP of 0.623 on the same dataset.
However, they used a much larger set of image features (37,152 dimensions).
Unimodal Inputs: Next, we evaluate the ability of the model to improve
classification of unimodal inputs by filling in other modalities. For multimodal models, the text input was only used during training. At test time,
all models were given only image inputs.
6.4 Retrieval Tasks
Multimodal Queries: The next set of experiments was designed to evaluate the quality of the learned joint representations. A database of images
was created by randomly selecting 5000 imagetext pairs from the test set.
We also randomly selected a disjoint set of 1000 images to be used as queries.
Each query contained both image and text modalities. Binary relevance labels were created by assuming that if any of the 38 class labels overlapped
between a query and a data point, then that data point is relevant to the
query. Fig. 5a shows the precision-recall curves for the DBM, DBN, and Autoencoder models (averaged over all queries). For each model, all queries and
all points in the database were mapped to the joint hidden representation
under that model. Cosine similarity function was used to match queries to
data points. The DBM model performs the best among the compared models
achieving a MAP of 0.622. The autoencoder and DBN models perform worse
with a MAP of 0.612 and 0.609 respectively.Note that even though there is
little overlap in terms of text, the model is able to perform well.
Unimodal Queries: The DBM model can also be used to query for unimodal inputs by filling in the missing modality. Fig. 5b shows the precisionrecall curves for the DBM model along with other unimodal models, where
20
each model received the same image queries as input. By effectively inferring
the missing text, the DBM model was able to achieve far better results than
any unimodal method (MAP of 0.614 as compared to 0.587 for an ImageDBM and 0.578 for an Image-DBN).
6.5 Multimodal learning setting
We will consider the learning settings shown in Figure. The overall task
can be divided into three phases – feature learning, supervised training, and
testing. We keep the supervised training and testing phases fixed and examine different feature learning models with multimodal data. In detail, we
consider three learning settings – multimodal fusion, cross modality learning,
and shared representation learning.
Figure 6.4: Multimodal learning setting
For the multimodal fusion setting, data from all modalities is available at
all phases; this represents the typical setting considered in most prior work
in audio-visual speech recognition [3]. In cross modality learning, one has
access to data from multiple modalities only during feature learning. During
the supervised training and testing phase, only data from a single modality
is provided. In this setting, the aim is to learn better single modality representations given unlabeled data from multiple modalities. Last, we consider
21
a shared representation learning setting, which is unique in that different
modalities are presented for supervised training and testing. This setting
allows us to evaluate if the feature representations can capture correlations
across different modalities. Specifically, studying this setting allows us to
assess whether the learned representations are modality-invariant.
6.6 Datasets and Task
Since only unlabeled data was required for unsupervised feature learning, we
combined diverse datasets to learn features. We used all the datasets for
feature learning. AVLetters and CUAVE were further used for supervised
classification. We ensured that no test data was used for unsupervised feature learning.
CUAVE 36 individuals saying the digits 0 to 9. We used the normal
portion of the dataset where each speaker was frontal facing and spoke each
digit 5 times. We evaluated digit classification on the CUAVE dataset in
a speaker independent setting. As there has not been a fixed protocol for
evaluation on this dataset, we chose to use odd-numbered speakers for the
test set and evennumbered ones for the training set.
AVLetters” 10 speakers saying the letters A to Z, three times each. The
dataset provided preextracted lip regions at 60x80 pixels. As we were not
able to obtain the raw audio information for this dataset, we used it for evaluation on a visual-only lipreading task. We report results on the third-test
settings used for comparisons.
AVLetters2: 5 speakers saying the letters A to Z, seven times each.
This is a new high definition version of the AVLetters dataset. We used this
dataset for unsupervised training only.
Stanford Dataset: 23 volunteers spoke the digits 0 to 9, letters A to Z
and selected sentences from the TIMIT dataset. We collected this data in
a similar fashion to the CUAVE dataset and used for unsupervised training
only.
22
TIMIT: We used the TIMIT dataset for unsupervised audio feature pretraining.
We note that in all datasets there is variability in the lips in terms of appearance, orientation and size.
Our features were evaluated on speech classification of isolated letters
and digits. We extracted features from overlapping windows. Since examples had varying durations, we divided each example into S equal slices and
performed average-pooling over each slice. The features from all slices were
subsequently concatenated together. We combined features using S = 1 and
S = 3 to form our final feature representation for classification using a linear
SVM.
6.7 Cross Modality Learning
We first evaluate the learned features in a setting where unlabeled data for
both modalities are available during feature learning, while during supervised
training and testing phases only a single modality is presented. In these experiments, we evaluate cross modality learning where one learns better representations for one modality (e.g., video) when given multiple modalities (e.g.,
audio and video) during feature learning. For the bimodal deep autoencoder,
we set the value of the other modality to zero when computing the shared
representation which is consistent with the feature learning phase. All deep
autoencoder models are trained with all available unlabeled audio and video
data.
On the AVLetters dataset, there is an improvement over hand-engineered
features from prior work. The deep autoencoder models performed the best
on the dataset, obtaining a classification score of 65.8%, outperforming the
best previous published results.
On the CUAVE dataset (Table 1b), there is an improvement by learning video features with both video audio compared to learning features with
only video data. The deep autoencoder models ultimately performs the best,
obtaining a classification score of 69.7%. In our model, we chose to use a
very simple front-end that only extracts bounding boxes (without any correction for orientation or perspective changes). A more sophisticated visual
23
front-end in conjunction with our models has the potential to do even better.
The video classification results show that the deep autoencoder model
achieves cross modality learning by discovering better video representations
when given additional audio data. In particular, even though the AVLetters
dataset did not have any audio data, we were able to obtain better performance by learning better video features using other unlabeled data sources
which had both audio and video data.
However, we also note that cross modality learning did not help to learn
better audio features; since our feature learning mechanism is unsupervised,
we find that our model learns features that adapt to the video modality but
are not useful for speech classification.
6.8 Cross-modal retrieval
Nowadays, mobile devices and emerging social websites (e.g., Facebook, Flickr,
YouTube, and Twitter) are changing the ways people interact with the world
and search information of interest. It is convenient if users can submit any
media content at hand as the query. Suppose we are on a visit to the Great
Wall, by taking a photo, we may expect to use the photo to retrieve the
relevant textual materials as visual guides for us. Therefore, cross-modal
retrieval, as a natural searching way, becomes increasingly important.
Cross-modal retrieval aims to take one type of data as the query to retrieve relevant data of another type. For example, the text is used as the
query to retrieve images. Furthermore, when users search information by
submitting a query of any media type, they can obtain search results across
various modalities, which is more comprehensive given that different modalities of data can provide complementary information to each other. More
recently, cross-modal retrieval has attracted considerable research attention.
The challenge of cross-modal retrieval is how to measure the content
similarity between different modalities of data, which is referred as the heterogeneity gap. Hence, compared with traditional retrieval methods, crossmodal retrieval requires cross-modal relationship modeling, so that users can
retrieve what they want by submitting what they have. Now, the main research effort is to design the effective ways to make the cross-modal retrieval
24
more accurate and more scalable.
In the cross-modal retrieval procedure, users can search various modalities of data including texts, images and videos, starting with any modality of
data as a query. The general framework of cross-modal retrieval, in which,
feature extraction for multimodal data is considered as the first step to represent various modalities of data. Based on these representations of multimodal data, cross-modal correlation modeling is performed to learn common
representations for various modalities of data. At last, the common representations enable the cross-modal retrieval by suitable solutions of search result
ranking and summarization.Shorly we can say multi-modal retrieval is to use
both the image and the text to find similar content (maybe also multi-modal
content, i.e. image+text, but maybe just images or just text). That is, I
search for content that matches somehow both the image and the text in the
query. cross-modal retrieval is to use one modality to find similar content in
the other modality. E.g. to use the text to find images matching that text
(which would then also match the original image, if the association between
the original image and the original text holds).
25
Chapter 7
Conclusion
This seminar presented a deep model for learning multimodal signals coupled with emotions and semantics. Particularly, we propose a multi-pathway
DBM architecture dealing with low-level features of various types and more
twenty-thousand dimensions, which is not previously attempted to the best
of our knowledge. The major advantage of this model is on capturing the
non-linear and complex correlations among different modalities in a joint
space. The model enjoys peculiarities such as learning is unsupervised and
can cope with samples of missing modalities. Compared with hand-crafted
features, our model generates much more compact features and allows natural cross-modal matching beyond late or early fusion. As demonstrated on
ImageTweets datasets, the features generated by mapping single modality
samples (text or visual) into the joint space consistently outperform handcrafted features in sentiment classification. In addition, we show the complementary between deep and hand-crafted features for emotion prediction
on Video Emotion dataset. Among the eight categories of emotion, nevertheless, the categories ’anticipation’ and ’surprise’ remain difficult either
with learnt or hand-tuned features. For video retrieval, our model shows
favorable performances, convincingly outperforms hand-crafted features over
different types of queries. Encouraging results are also obtained when applying the deep features for cross-modal retrieval, which is not possible for
hand-crafted features.Hence, the learning is fully generative and the model
is more expressive.
26
Bibliography
[1] Lei Pang, Shiai Zhu, and Chong-Wah Ngo Deep Multimodal Learning
for Affective Analysis and Retrieval IEEE Transactions on Multimedia,
Volume 17,No.11,November 2015
[2] Y.-G. Jiang, B. Xu, and X. Xue,Predicting emotions in user-generated
videos, in Proc. AAAI, pp. 73–79,in 2014.
[3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau,Sentiment analysis of Twitter data, in Proc. Workshop Languages
Social, pp. 30–38 Media, 2011.
[4] Dalal, N., and Triggs,, Histograms of oriented gradients for human detection, In Proc. of IEEE Conference on Computer Vision and Pattern
Recognition 2005.
[5] T. Chen, D. Borth, T. Darrell, and S. Chang,DeepSentiBank:Visual
sentiment concept classification with deep convolutional neuralnetworks,CoRR, 2014
[6] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, Large-scale
visual sentiment ontology and detectors using adjective noun pairs, in
Proc. ACM MM, pp. 223–232,ACM 2013.
[7] Sivic, J., and Zisserman, A text retrieval approach to object matching
in videos, in Proceedings of the Ninth IEEE International Conference
on Computer Vision (ICCV 2003) 2-Volume Set,2003
[8] Y. Bengio: Learning deep architectures for ai, Foundations and Trends
in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
27