Lecture8 1MultimodalAlignment

Multimodal Machine Learning

Lecture 8.1: Multimodal alignment
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

Lecture objectives

▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment

▪ Multimodal alignment – finding relationships

and correspondences between two or more
modalities Modality 1 Modality 2
▪ Examples
▪ Images with captions t1
▪ Recipe steps with a how-to video

t2 t4

Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
tn tn
Explicit multimodal-alignment

▪ Explicit alignment - goal is to find correspondences

between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment

▪ Implicit alignment - uses internal latent alignment of

modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
Temporal sequence alignment

- Re-aligning asynchronous
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping

▪ We have two unaligned temporal unimodal

▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
−𝒚 𝑦

▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
▪ Finding these indices is called Dynamic Time
Dynamic Time Warping continued

▪ Lowest cost path in a cost

(𝒑𝑙𝑥 , 𝒑𝒍 )
▪ Restrictions
▪ Monotonicity – no going back in
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued

▪ Lowest cost path in a cost

(𝒑𝑙𝑥 , 𝒑𝒍 )
▪ Solved using dynamic
programming while respecting
the restrictions

(𝒑𝑡𝑥 , 𝒑𝒕 )

(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!

= 𝐗𝐖𝑥
= 𝐘𝐖y

Alternative objective:
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗

DTW - limitations

▪ Computationally complex

m sequences

▪ Sensitive to outliers

▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

projection of Y
Linear projections maximizing
1 correlation

projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚

Unit variance of the projection ··· ···

3 vectors 𝑼 𝑽
··· ···
Text Image

Canonical Correlation Analysis reminder

▪ When data is normalized it is actually equivalent to smallest RMSE

▪ CCA loss can also be re-written as:

projection of Y
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎

projection of X

𝑯𝒙 𝑯𝒚
··· ···
··· ···
Text Image
Canonical Time Warping

▪ Dynamic Time Warping + Canonical Correlation Analysis

= Canonical Time Warping
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇

▪ Allows to align multi-modal or multi-view (same modality

but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping

𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇

Optimized by Coordinate-descent – fix one set of parameters,

optimize another
Generalized Eigen-decomposition

𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽


[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping

▪ Generalize to multiple sequences all of different

𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments

(1) Time warping

(2) Spatial embedding

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]

Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Subject 2: 217 frames
Subject 3: 222 frames


Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames

Alignment examples (multimodal)
Canonical time warping - limitations

▪ Linear transform between modalities

▪ How to address this?
Deep Canonical Time Warping

𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Deep Canonical Time Warping

𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲

▪ The projections are orthogonal (like in DCCA)

▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]

Implicit alignment
Implicit alignment

▪ We looked how to explicitly align temporal data

▪ Could use that as an internal (hidden) step in
our models?
▪ Can we instead encourage the model to align
data when solving a different problem?
▪ Yes!
▪ Graphical models
▪ Neural attention models (focus of today’s lecture)
Attention models
Attention in humans

▪ Foveal vision – we only see in “high resolution” in 2 degrees of

▪ We focus our attention selectively to certain words (for example our
▪ We attend to relevant speech in a noisy room
Attention models in deep learning

▪ Many examples of attention models in recent years!

▪ Why:
▪ Allows for implicit data alignment
▪ Good results empirically
▪ In some cases faster (don’t need to focus on all the image)
▪ Better Interpretability
Types of Attention Models

▪ Recent attention models can be roughly split into

three major categories
1. Soft attention
▪ Acts like a gate function. Deterministic inference.
2. Transform network
▪ Warp the input to better align with canonical view
3. Hard attention
▪ Includes stochastic processes. Related to reinforcement
Soft attention
Machine Translation

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each

language can be seen almost as a modality.
Machine Translation with RNNs

▪ A quick reminder about encoder

decoder frameworks
▪ First we encode the sentence Dog on the beach

▪ Then we decode it in a different



Context /
embedding /
Encoder sentence
le chien sur la plage
Machine Translation with RNNs

▪ What is the problem with this?

▪ What happens when the sentences are very long?

▪ We expect the encoders hidden state to capture everything in a

sentence, a very complex state in a single vector, such as

The agreement on the European Economic

Area was signed in August 1992.

L’ accord sur la zone économique

européenne a été signé en août 1992.
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states


Attention Hidden state 𝒔0

module /

Context 𝒛𝟎

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog on

Attention Hidden state 𝒔1

module /

Context 𝒛𝟏

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

on the

Attention Hidden state 𝒔2

module /

Context 𝒛𝟐

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation

by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
How do we encode attention

▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
▪ 𝒛𝑖 = σ𝑗=𝑖
𝛼𝑖𝑗 𝒉𝑗

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i

MT with attention

So how do we determine 𝛼𝑖𝑗 ,

exp(𝑒𝑖𝑗 )
▪ 𝛼𝑖,𝑗 = 𝑇𝑥 exp(𝑒 )
- softmax, making sure they sum to 1
σ𝑘=1 𝑖𝑘

▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights

𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention

Basically we are using a neural network to tell us where a

neural network should be looking!
▪ We can use with RNN, LSTM or GRU
▪ Encoder being used is the same structure as before
▪ Can use uni-directional
▪ Can use bi-directional
▪ Model can be trained using our regular back-propagation
through time, all of the modules are differentiable
Does it work?
MT with attention recap

▪ Get good translation results (especially for long

▪ Also get a (soft) alignment of sentences in
different languages
▪ Extra interpretability of method functioning
▪ How do we move to multimodal?
Visual captioning with soft attention

[Show, Attend and Tell: Neural

Image Caption Generation with
Visual Attention, Xu et al., 2015]
Recap RNN for Captioning

Bird in the sky

Why might we not want to focus on the final layer?

Looking at more fine grained features

over L
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦0 𝑧2 𝑦1

Expectation First word

features: D

Soft attention

▪ Allows for latent data alignment

▪ Allows us to get an idea of what the network “sees”
▪ Can be optimized using back propagation

▪ Good at paper naming!

▪ Show, Attend and Tell (extension of Show and Tell)
▪ Listen, Attend and Walk
▪ Listen, Attend and Spell
▪ Ask, Attend and Answer
Spatial Transformer
Some limitations of grid based attention

▪ Can we fixate on small parts of image but still have easy

end-to-end training?
Spatial Transformer Networks

Can we make this

function differentiable?

Spatial Transformer Networks

Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this input
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3

Spatial Transformer Networks

Idea: Function mapping pixel

coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3

Network “attends” to
input by predicting 𝜃

Spatial Transformer Networks

Spatial Transformer Networks

Examples on real world data

▪ Results on traffic sign recognition

Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html

Recap on Spatial Transformer Networks

▪ Differentiable so we can just use back-prop for training end-to-end

▪ Can use complex models for focusing on an image
▪ Affine and Piece-Wise Affine, Perspective, This Plate Splines
▪ Can use to focus on certain parts of an image
▪ We can use it instead of grid based soft and hard attention for multi-
modal tasks
Glimpse Network
(Hard Attention)
Hard attention

▪ Soft attention requires computing a representation for the whole

image or sentence
▪ Hard attention on the other hand forces looking only at one part
▪ Main motivation was reduced computational cost rather than
improved accuracy (although that happens a bit as well)
▪ Saccade followed by a glimpse – how human visual system

[Recurrent Models of Visual Attention, Mnih, 2014]

[Multiple Object Recognition with Visual Attention,
Ba, 2015]
Hard attention examples
Glimpse Sensor

▪ Looking at a part of an image at different scales

▪ At a number of different scales combined to a single multichannel

image (human retina like representation)
▪ Given a location 𝑙𝑡 output an image summary at that location
[Recurrent Models of Visual Attention, Mnih, 2014]
Glimpse network

▪ Combining the Glimpse and the location of the glimpse into a joint network

▪ The glimpse is followed by a feedforward network (CNN or a DNN)

▪ The exact formulation of how the location and appearance are combined
varies, the important thing is combining what and where
▪ Differentiable with respect to glimpse parameters but not the location
Overall Architecture - Emission network

▪ Given an image a glimpse

location 𝑙𝑡 , and optionally an
action 𝑎𝑡
▪ Action can be:
▪ Some action in a dynamic
system – press a button etc.
▪ Classification of an object
▪ Word output
▪ This is an RNN with two output
gates and a slightly more
complex input gate!
Recurrent model of Visual Attention (RAM)

▪ Sample locations of glimpses

leading to updates in the network
▪ Use gradient descent to update the
weights (the glimpse network
weights are differentiable)
▪ The emission network is an RNN
▪ Not as simple as backprop but
▪ Turns out this is very similar and in
some cases equivalent to
reinforcement learning using the
REINFORCE learning rule
[Williams, 1992]
Multi-modal alignment
Multimodal-alignment recap

▪ Explicit alignment - aligns two or more modalities (or

views) as an actual task. The goal is to find
correspondences between modalities
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Deep Canonical Time Warping
▪ Implicit alignment - uses internal latent alignment of
modalities in order to better solve various problems
▪ Attention models
▪ Soft attention
▪ Spatial transformer networks
▪ Hard attention

