Lecture8 1MultimodalAlignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Advanced

Multimodal Machine Learning


Lecture 8.1: Multimodal alignment
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

1
Upcoming Schedule

▪ First project assignment:


▪ Proposal presentation (10/3 and 10/5)
▪ First project report (10/8)
▪ Midterm project assignment
▪ Midterm presentations (Tuesday 11/6 & Thursday
11/8)
▪ Midterm report (Sunday 11/11) – No extensions
▪ Final project assignment
▪ Final presentation (TBD)
▪ Final report (12/11 at 11:59pm ET)
Midterm Presentation Instructions

▪ 7-8 minute presentations (max: 8 mins)


▪ +1.5 minutes for written feedback and notes
▪ All team members should be involved.
▪ The ordering of the presentations (Tuesday vs.
Thursday) is the inverse from the proposals.
▪ The presentations will be from 4:30pm – 6pm
▪ Please arrive on time!
Midterm Presentation Instructions

▪ General definition of your research problem, including a


mathematical formalization of the problem. Include definitions
of the main variables and overall objective function (2-3
slides)
▪ Explain at least two multimodal baseline model for your
research problem (2-4 slides)
▪ Present current results of this baseline model(s) on your
dataset. You should study the failure cases of the baseline
model (3-5 slides)
▪ Describe the research directions you are planning to explore.
Discuss how they will address some of the shortcoming of
your baseline model. (2-3 slides)
Midterm Project Report Instructions

▪ Main sections:
▪ Abstract
▪ Introduction
▪ Related work
▪ Problem statement
▪ Multimodal baseline models
▪ Experimental methodology
▪ Results and discussion
▪ Proposed approaches
Lecture objectives

▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment
7
Multimodal-alignment

▪ Multimodal alignment – finding relationships


and correspondences between two or more
modalities Modality 1 Modality 2
▪ Examples
▪ Images with captions t1
▪ Recipe steps with a how-to video

t2 t4

Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
models)
tn tn
Explicit multimodal-alignment

▪ Explicit alignment - goal is to find correspondences


between modalities
▪ Aligning speech signal to a transcript
▪ Aligning two out-of sync sequences
▪ Co-referring expressions
Implicit multimodal-alignment

▪ Implicit alignment - uses internal latent alignment of


modalities in order to better solve various problems
▪ Machine Translation
▪ Cross-modal retrieval
▪ Image & Video Captioning
▪ Visual Question Answering
Explicit alignment
11
Temporal sequence alignment

Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping

▪ We have two unaligned temporal unimodal


signals
▪ 𝐗 = 𝒙1 , 𝒙2 , … , 𝒙𝑛𝑥 ∈ ℝ𝑑×𝑛𝑥
▪ 𝐘 = 𝒚1 , 𝒚2 , … , 𝒚𝑛𝑦 ∈ ℝ𝑑×𝑛𝑦
▪ Find set of indices to minimize the alignment
difference:
𝑙
𝑦 2
𝐿(𝒑𝑡𝑥 , 𝒑𝑡 ) =෍ 𝒙 𝒑𝑥
𝑡
−𝒚 𝑦
𝒑𝑡
2
𝑡=1

𝑦
▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Restrictions
▪ Monotonicity – no going back in
time
▪ Continuity - no gaps
▪ Boundary conditions - start and
end at the same points 𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
▪ Warping window - don’t get too far
from diagonal
▪ Slope constraint – do not insert or
skip too much
𝑦
(𝒑1𝑥 , 𝒑1 )
Dynamic Time Warping continued

▪ Lowest cost path in a cost


𝑦
(𝒑𝑙𝑥 , 𝒑𝒍 )
matrix
▪ Solved using dynamic
programming while respecting
the restrictions

𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )

𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = ෍ 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!

= 𝐗𝐖𝑥
=
= 𝐘𝐖y

Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗

16
DTW - limitations

▪ Computationally complex

m sequences

▪ Sensitive to outliers

▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰 , 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎 for 𝑖 ≠ 𝑗

projection of Y
Linear projections maximizing
1 correlation

projection of X

2 Orthogonal projections 𝑯𝒙 𝑯𝒚

Unit variance of the projection ··· ···


3 vectors 𝑼 𝑽
··· ···
Text Image
𝑿 𝒀

18
Canonical Correlation Analysis reminder

▪ When data is normalized it is actually equivalent to smallest RMSE


reconstruction
▪ CCA loss can also be re-written as:

projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹

subject to: 𝑼𝑻 𝚺𝒀𝒀 𝑼 = 𝑽𝑻 𝚺𝒀𝒀 𝑽 = 𝑰, 𝒖𝑻(𝑗) 𝚺𝑿𝒀 𝒗(𝑖) = 𝟎


projection of X

𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping

▪ Dynamic Time Warping + Canonical Correlation Analysis


= Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

▪ Allows to align multi-modal or multi-view (same modality


but from a different point of view)
▪ 𝑾𝒙 , 𝑾𝒚 – temporal alignment
▪ 𝑼, 𝑽 – cross-modal (spatial) alignment

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping

2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹

Optimized by Coordinate-descent – fix one set of parameters,


optimize another
Generalized Eigen-decomposition

𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽

Gauss-Newton

[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping

▪ Generalize to multiple sequences all of different


modality
2
𝐿(𝑼𝒊 , 𝑾𝒊 ) = ෍ ෍ 𝐔𝑖𝑇 𝐗 i 𝐖i − 𝑇
𝐔𝑗 𝐗 j 𝐖𝑗
𝐹
𝑖=1 𝑗=1
▪ 𝑾𝒊 – set of temporal alignments
▪ 𝑼𝒊 – set of cross-modal (spatial) alignments

(1) Time warping


(2) Spatial embedding

[Generalized Canonical Time Warping, Zhou and De la Tore, 2016, TPAMI]


Alignment examples (unimodal)
CMU Motion Capture
Subject 1: 199 frames
Subject 2: 217 frames
Subject 3: 222 frames

Weizmann

Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames

23
Alignment examples (multimodal)
Canonical time warping - limitations

▪ Linear transform between modalities


▪ How to address this?
Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ Could be seen as generalization of DCCA and GTW

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Deep Canonical Time Warping

2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹

▪ The projections are orthogonal (like in DCCA)


▪ Optimization is again iterative:
▪ Solve for alignment (𝑾𝒙 , 𝑾𝒚 ) with fixed projections (𝜽1 , 𝜽2 )
▪ Eigen decomposition
▪ Solve for projections (𝜽1 , 𝜽2 ) with fixed alignment (𝑾𝒙 , 𝑾𝒚 )
▪ Gradient descent
▪ Repeat till convergence

[Deep Canonical Time Warping, Trigeorgis et al., 2016, CVPR]


Implicit alignment
28
Implicit alignment

▪ We looked how to explicitly align temporal data


▪ Could use that as an internal (hidden) step in
our models?
▪ Can we instead encourage the model to align
data when solving a different problem?
▪ Yes!
▪ Graphical models
▪ Neural attention models (focus of today’s lecture)
Attention models
30
Attention in humans

▪ Foveal vision – we only see in “high resolution” in 2 degrees of


vision
▪ We focus our attention selectively to certain words (for example our
names)
▪ We attend to relevant speech in a noisy room
Attention models in deep learning

▪ Many examples of attention models in recent years!


▪ Why:
▪ Allows for implicit data alignment
▪ Good results empirically
▪ In some cases faster (don’t need to focus on all the image)
▪ Better Interpretability
Types of Attention Models

▪ Recent attention models can be roughly split into


three major categories
1. Soft attention
▪ Acts like a gate function. Deterministic inference.
2. Transform network
▪ Warp the input to better align with canonical view
3. Hard attention
▪ Includes stochastic processes. Related to reinforcement
learning.
Soft attention
34
Machine Translation

▪ Given a sentence in one language translate it to another

Dog on the beach le chien sur la plage

▪ Not exactly multimodal task – but a good start! Each


language can be seen almost as a modality.
Machine Translation with RNNs

▪ A quick reminder about encoder


decoder frameworks
▪ First we encode the sentence Dog on the beach

▪ Then we decode it in a different


language

Decode

Context /
embedding /
Encoder sentence
representation
le chien sur la plage
Machine Translation with RNNs

▪ What is the problem with this?


▪ What happens when the sentences are very long?

▪ We expect the encoders hidden state to capture everything in a


sentence, a very complex state in a single vector, such as

The agreement on the European Economic


Area was signed in August 1992.

L’ accord sur la zone économique


européenne a été signé en août 1992.
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog

Attention Hidden state 𝒔0


module /
gate

Context 𝒛𝟎

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

Dog on

Attention Hidden state 𝒔1


module /
gate

Context 𝒛𝟏

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
Decoder – attention model

▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states

on the

Attention Hidden state 𝒔2


module /
gate

Context 𝒛𝟐

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓

Encoder [Bahdanau et al., “Neural Machine Translation


by Jointly Learning to Align and Translate”, ICLR
le chien sur la plage 2015]
How do we encode attention

▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
𝑇
▪ 𝒛𝑖 = σ𝑗=𝑖
𝑥
𝛼𝑖𝑗 𝒉𝑗

𝛼𝑖𝑗 - the (scalar) attention for word j at generation step i


MT with attention

So how do we determine 𝛼𝑖𝑗 ,


exp(𝑒𝑖𝑗 )
▪ 𝛼𝑖,𝑗 = 𝑇𝑥 exp(𝑒 )
- softmax, making sure they sum to 1
σ𝑘=1 𝑖𝑘

Where:
▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights

𝑇𝑥
σ
𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention

Basically we are using a neural network to tell us where a


neural network should be looking!
▪ We can use with RNN, LSTM or GRU
▪ Encoder being used is the same structure as before
▪ Can use uni-directional
▪ Can use bi-directional
▪ Model can be trained using our regular back-propagation
through time, all of the modules are differentiable
Does it work?
MT with attention recap

▪ Get good translation results (especially for long


sentences)
▪ Also get a (soft) alignment of sentences in
different languages
▪ Extra interpretability of method functioning
▪ How do we move to multimodal?
Visual captioning with soft attention

[Show, Attend and Tell: Neural


Image Caption Generation with
Visual Attention, Xu et al., 2015]
Recap RNN for Captioning

Bird in the sky

Why might we not want to focus on the final layer?


Looking at more fine grained features

Distribution
Output
over L
word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2

𝑠0 𝑠1 𝑠2

𝑧1 𝑦0 𝑧2 𝑦1

Expectation First word


over
features: D

48
Soft attention

▪ Allows for latent data alignment


▪ Allows us to get an idea of what the network “sees”
▪ Can be optimized using back propagation

▪ Good at paper naming!


▪ Show, Attend and Tell (extension of Show and Tell)
▪ Listen, Attend and Walk
▪ Listen, Attend and Spell
▪ Ask, Attend and Answer
Spatial Transformer
networks
50
Some limitations of grid based attention

▪ Can we fixate on small parts of image but still have easy


end-to-end training?
Spatial Transformer Networks

Can we make this


function differentiable?

52
Spatial Transformer Networks

Idea: Function mapping pixel


coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
Can we make this input
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

53
Spatial Transformer Networks

Idea: Function mapping pixel


coordinates (𝑥 𝑡 , 𝑦 𝑡 ) of output to
pixel coordinates (𝑥 𝑠 , 𝑦 𝑠 ) of
input
Can we make this
function differentiable?
𝑥𝑖𝑠 𝜃1,1 𝜃1,2 𝜃1,3 𝑥𝑖𝑡
= 𝑦𝑖𝑡
𝑦𝑖𝑠 𝜃2,1 𝜃2,2 𝜃2,3
1

Network “attends” to
input by predicting 𝜃

54
Spatial Transformer Networks

55
Spatial Transformer Networks

56
Examples on real world data

▪ Results on traffic sign recognition

Code available http://torch.ch/blog/2015/09/07/spatial_transformers.html


Recap on Spatial Transformer Networks

▪ Differentiable so we can just use back-prop for training end-to-end


▪ Can use complex models for focusing on an image
▪ Affine and Piece-Wise Affine, Perspective, This Plate Splines
▪ Can use to focus on certain parts of an image
▪ We can use it instead of grid based soft and hard attention for multi-
modal tasks
Glimpse Network
(Hard Attention)
59
Hard attention

▪ Soft attention requires computing a representation for the whole


image or sentence
▪ Hard attention on the other hand forces looking only at one part
▪ Main motivation was reduced computational cost rather than
improved accuracy (although that happens a bit as well)
▪ Saccade followed by a glimpse – how human visual system
works

[Recurrent Models of Visual Attention, Mnih, 2014]


[Multiple Object Recognition with Visual Attention,
Ba, 2015]
Hard attention examples
Glimpse Sensor

▪ Looking at a part of an image at different scales

▪ At a number of different scales combined to a single multichannel


image (human retina like representation)
▪ Given a location 𝑙𝑡 output an image summary at that location
[Recurrent Models of Visual Attention, Mnih, 2014]
Glimpse network

▪ Combining the Glimpse and the location of the glimpse into a joint network

▪ The glimpse is followed by a feedforward network (CNN or a DNN)


▪ The exact formulation of how the location and appearance are combined
varies, the important thing is combining what and where
▪ Differentiable with respect to glimpse parameters but not the location
Overall Architecture - Emission network

▪ Given an image a glimpse


location 𝑙𝑡 , and optionally an
action 𝑎𝑡
▪ Action can be:
▪ Some action in a dynamic
system – press a button etc.
▪ Classification of an object
▪ Word output
▪ This is an RNN with two output
gates and a slightly more
complex input gate!
Recurrent model of Visual Attention (RAM)

▪ Sample locations of glimpses


leading to updates in the network
▪ Use gradient descent to update the
weights (the glimpse network
weights are differentiable)
▪ The emission network is an RNN
▪ Not as simple as backprop but
doable
▪ Turns out this is very similar and in
some cases equivalent to
reinforcement learning using the
REINFORCE learning rule
[Williams, 1992]
Multi-modal alignment
recap
67
Multimodal-alignment recap

▪ Explicit alignment - aligns two or more modalities (or


views) as an actual task. The goal is to find
correspondences between modalities
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Deep Canonical Time Warping
▪ Implicit alignment - uses internal latent alignment of
modalities in order to better solve various problems
▪ Attention models
▪ Soft attention
▪ Spatial transformer networks
▪ Hard attention

You might also like