Lecture8 1MultimodalAlignment

Advanced
Multimodal Machine Learning

Lecture 8.1: Multimodal alignment
Louis-Philippe Morency
* Original version co-developed with Tadas Baltrusaitis
1
Upcoming Schedule
▪ First project assignment:

▪ Proposal presentation (10/3 and 10/5)
▪ First project report (10/8)
▪ Midterm project assignment
▪ Midterm presentations (Tuesday 11/6 & Thursday
11/8)
▪ Midterm report (Sunday 11/11) – No extensions
▪ Final project assignment
▪ Final presentation (TBD)
▪ Final report (12/11 at 11:59pm ET)
Midterm Presentation Instructions
▪ 7-8 minute presentations (max: 8 mins)

▪ +1.5 minutes for written feedback and notes
▪ All team members should be involved.
▪ The ordering of the presentations (Tuesday vs.
Thursday) is the inverse from the proposals.
▪ The presentations will be from 4:30pm – 6pm
▪ Please arrive on time!
Midterm Presentation Instructions
▪ General definition of your research problem, including a

mathematical formalization of the problem. Include definitions
of the main variables and overall objective function (2-3
slides)
▪ Explain at least two multimodal baseline model for your
research problem (2-4 slides)
▪ Present current results of this baseline model(s) on your
dataset. You should study the failure cases of the baseline
model (3-5 slides)
▪ Describe the research directions you are planning to explore.
Discuss how they will address some of the shortcoming of
your baseline model. (2-3 slides)
Midterm Project Report Instructions
▪ Main sections:
▪ Abstract
▪ Introduction
▪ Related work
▪ Problem statement
▪ Multimodal baseline models
▪ Experimental methodology
▪ Results and discussion
▪ Proposed approaches
Lecture objectives
▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment
7
Multimodal-alignment
▪ Multimodal alignment – finding relationships

and correspondences between two or more
modalities Modality 1 Modality 2
▪ Examples
▪ Images with captions t1
▪ Recipe steps with a how-to video
▪
t2 t4
Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
models)
tn tn
Explicit multimodal-alignment
▪ Explicit alignment - goal is to find correspondences

between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment
▪ Implicit alignment - uses internal latent alignment of

modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
11
Temporal sequence alignment
Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping
▪ We have two unaligned temporal unimodal

signals
▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
difference:
𝑙
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
𝑡
−𝒚 𝑦
𝒑𝑡
2
𝑡=1
𝑦
▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued
▪ Lowest cost path in a cost

𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Restrictions
▪ Monotonicity – no going back in
time
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
𝑦
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued
▪ Lowest cost path in a cost

𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Solved using dynamic
programming while respecting
the restrictions
𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!
= 𝐗𝐖𝑥
=
= 𝐘𝐖y
Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗
16
DTW - limitations
▪ Computationally complex
m sequences
▪ Sensitive to outliers
▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗
projection of Y
Linear projections maximizing
1 correlation
projection of X
2 Orthogonal projections 𝑯𝒙 𝑯𝒚
Unit variance of the projection ··· ···

3 vectors 𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
18
Canonical Correlation Analysis reminder
▪ When data is normalized it is actually equivalent to smallest RMSE

reconstruction
▪ CCA loss can also be re-written as:
projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹
subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎

projection of X
𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping
▪ Dynamic Time Warping + Canonical Correlation Analysis

= Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹
▪ Allows to align multi-modal or multi-view (same modality

but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment
[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹
Optimized by Coordinate-descent – fix one set of parameters,

optimize another
Generalized Eigen-decomposition
𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽
Gauss-Newton
[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping
▪ Generalize to multiple sequences all of different

modality
2
𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝐹
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments
(1) Time warping

(2) Spatial embedding
[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]

Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Weizmann
23
Alignment examples (multimodal)
Canonical time warping - limitations
▪ Linear transform between modalities

▪ How to address this?
Deep Canonical Time Warping
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
▪ Could be seen as generalization of DCCA and GTW
[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Deep Canonical Time Warping
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
▪ The projections are orthogonal (like in DCCA)

▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence
[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Implicit alignment
28
Implicit alignment
▪ We looked how to explicitly align temporal data

▪ Could use that as an internal (hidden) step in
our models?
▪ Can we instead encourage the model to align
data when solving a different problem?
▪ Yes!
▪ Graphical models
▪ Neural attention models (focus of today’s lecture)
Attention models
30
Attention in humans
▪ Foveal vision – we only see in “high resolution” in 2 degrees of

vision
▪ We focus our attention selectively to certain words (for example our
names)
▪ We attend to relevant speech in a noisy room
Attention models in deep learning
▪ Many examples of attention models in recent years!

▪ Why:
▪ Allows for implicit data alignment
▪ Good results empirically
▪ In some cases faster (don’t need to focus on all the image)
▪ Better Interpretability
Types of Attention Models
▪ Recent attention models can be roughly split into

three major categories
1. Soft attention
▪ Acts like a gate function. Deterministic inference.
2. Transform network
▪ Warp the input to better align with canonical view
3. Hard attention
▪ Includes stochastic processes. Related to reinforcement
learning.
Soft attention
34
Machine Translation
▪ Given a sentence in one language translate it to another
Dog on the beach le chien sur la plage
▪ Not exactly multimodal task – but a good start! Each

language can be seen almost as a modality.
Machine Translation with RNNs
▪ A quick reminder about encoder

decoder frameworks
▪ First we encode the sentence Dog on the beach
▪ Then we decode it in a different

language
Decode
Context /
embedding /
Encoder sentence
representation
le chien sur la plage
Machine Translation with RNNs
▪ What is the problem with this?

▪ What happens when the sentences are very long?
▪ We expect the encoders hidden state to capture everything in a

sentence, a very complex state in a single vector, such as
The agreement on the European Economic

Area was signed in August 1992.
L’ accord sur la zone économique

européenne a été signé en août 1992.
Decoder – attention model
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog
Attention Hidden state 𝒔0

module /
gate
Context 𝒛𝟎
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Dog on

module /
gate
Context 𝒛𝟏
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

on the

module /
gate
Context 𝒛𝟐
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

How do we encode attention
▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
𝑇
▪ 𝒛𝑖 = σ𝑗=𝑖
𝑥
𝛼𝑖𝑗 𝒉𝑗
𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i

MT with attention
So how do we determine 𝛼𝑖𝑗 ,

exp(𝑒𝑖𝑗 )
▪ 𝛼𝑖,𝑗 = 𝑇𝑥 exp(𝑒 )
- softmax, making sure they sum to 1
σ𝑘=1 𝑖𝑘
Where:
▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights
𝑇𝑥
σ
𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention
Basically we are using a neural network to tell us where a

neural network should be looking!
▪ We can use with RNN, LSTM or GRU
▪ Encoder being used is the same structure as before
▪ Can use uni-directional
▪ Can use bi-directional
▪ Model can be trained using our regular back-propagation
through time, all of the modules are differentiable
Does it work?
MT with attention recap
▪ Get good translation results (especially for long

sentences)
▪ Also get a (soft) alignment of sentences in
different languages
▪ Extra interpretability of method functioning
▪ How do we move to multimodal?
Visual captioning with soft attention
[Show, Attend and Tell: Neural

Image Caption Generation with
Visual Attention, Xu et al., 2015]
Recap RNN for Captioning
Bird in the sky
Why might we not want to focus on the final layer?

Looking at more fine grained features
Distribution
Output
over L
word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2
𝑠0 𝑠1 𝑠2
𝑧1 𝑦0 𝑧2 𝑦1
Expectation First word

over
features: D
48
Soft attention
▪ Allows for latent data alignment

▪ Allows us to get an idea of what the network “sees”
▪ Can be optimized using back propagation
▪ Good at paper naming!

▪ Show, Attend and Tell (extension of Show and Tell)
▪ Listen, Attend and Walk
▪ Listen, Attend and Spell
▪ Ask, Attend and Answer
Spatial Transformer
networks
50
Some limitations of grid based attention
▪ Can we fixate on small parts of image but still have easy

end-to-end training?
Spatial Transformer Networks
Can we make this

function differentiable?
52
Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this input
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1
53
Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
input
Can we make this
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1
Network “attends” to
input by predicting 𝜃
54
55
56
Examples on real world data
▪ Results on traffic sign recognition
Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html

Recap on Spatial Transformer Networks
▪ Differentiable so we can just use back-prop for training end-to-end

▪ Can use complex models for focusing on an image
▪ Affine and Piece-Wise Affine, Perspective, This Plate Splines
▪ Can use to focus on certain parts of an image
▪ We can use it instead of grid based soft and hard attention for multi-
modal tasks
Glimpse Network
(Hard Attention)
59
Hard attention
▪ Soft attention requires computing a representation for the whole

image or sentence
▪ Hard attention on the other hand forces looking only at one part
▪ Main motivation was reduced computational cost rather than
improved accuracy (although that happens a bit as well)
▪ Saccade followed by a glimpse – how human visual system
works
[Recurrent Models of Visual Attention, Mnih, 2014]

[Multiple Object Recognition with Visual Attention,
Ba, 2015]
Hard attention examples
Glimpse Sensor
▪ Looking at a part of an image at different scales
▪ At a number of different scales combined to a single multichannel

image (human retina like representation)
▪ Given a location 𝑙𝑡 output an image summary at that location
[Recurrent Models of Visual Attention, Mnih, 2014]
Glimpse network
▪ Combining the Glimpse and the location of the glimpse into a joint network
▪ The glimpse is followed by a feedforward network (CNN or a DNN)

▪ The exact formulation of how the location and appearance are combined
varies, the important thing is combining what and where
▪ Differentiable with respect to glimpse parameters but not the location
Overall Architecture - Emission network
▪ Given an image a glimpse

location 𝑙𝑡 , and optionally an
action 𝑎𝑡
▪ Action can be:
▪ Some action in a dynamic
system – press a button etc.
▪ Classification of an object
▪ Word output
▪ This is an RNN with two output
gates and a slightly more
complex input gate!
Recurrent model of Visual Attention (RAM)
▪ Sample locations of glimpses

leading to updates in the network
▪ Use gradient descent to update the
weights (the glimpse network
weights are differentiable)
▪ The emission network is an RNN
▪ Not as simple as backprop but
doable
▪ Turns out this is very similar and in
some cases equivalent to
reinforcement learning using the
REINFORCE learning rule
[Williams, 1992]
Multi-modal alignment
recap
67
Multimodal-alignment recap
▪ Explicit alignment - aligns two or more modalities (or

views) as an actual task. The goal is to find
correspondences between modalities
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Deep Canonical Time Warping
▪ Implicit alignment - uses internal latent alignment of
modalities in order to better solve various problems
▪ Attention models
▪ Soft attention
▪ Spatial transformer networks
▪ Hard attention

Lecture8 1MultimodalAlignment

Uploaded by

Copyright:

Available Formats

Lecture8 1MultimodalAlignment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture8 1MultimodalAlignment

Uploaded by

Copyright:

Available Formats

Advanced

Multimodal Machine Learning

* Original version co-developed with Tadas Baltrusaitis

▪ First project assignment:

▪ 7-8 minute presentations (max: 8 mins)

▪ General definition of your research problem, including a

▪ Multimodal alignment – finding relationships

▪ Explicit alignment - goal is to find correspondences

▪ Implicit alignment - uses internal latent alignment of

▪ We have two unaligned temporal unimodal

▪ Lowest cost path in a cost

▪ Lowest cost path in a cost

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

Unit variance of the projection ··· ···

▪ When data is normalized it is actually equivalent to smallest RMSE

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎

▪ Dynamic Time Warping + Canonical Correlation Analysis

▪ Allows to align multi-modal or multi-view (same modality

Optimized by Coordinate-descent – fix one set of parameters,

▪ Generalize to multiple sequences all of different

(1) Time warping

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]

▪ Linear transform between modalities

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

▪ The projections are orthogonal (like in DCCA)

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

▪ We looked how to explicitly align temporal data

▪ Foveal vision – we only see in “high resolution” in 2 degrees of

▪ Many examples of attention models in recent years!

▪ Recent attention models can be roughly split into

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each

▪ A quick reminder about encoder

▪ Then we decode it in a different

▪ What is the problem with this?

▪ We expect the encoders hidden state to capture everything in a

The agreement on the European Economic

L’ accord sur la zone économique

Attention Hidden state 𝒔0

Encoder [Bahdanau et al., “Neural Machine Translation

Attention Hidden state 𝒔1

Encoder [Bahdanau et al., “Neural Machine Translation

Attention Hidden state 𝒔2

Encoder [Bahdanau et al., “Neural Machine Translation

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i

So how do we determine 𝛼𝑖𝑗 ,

Basically we are using a neural network to tell us where a

▪ Get good translation results (especially for long

[Show, Attend and Tell: Neural

Bird in the sky

Why might we not want to focus on the final layer?

Expectation First word

▪ Allows for latent data alignment

▪ Good at paper naming!

▪ Can we fixate on small parts of image but still have easy

Can we make this

Idea: Function mapping pixel