Lecture8 1MultimodalAlignment
Lecture8 1MultimodalAlignment
Lecture8 1MultimodalAlignment
1
Upcoming Schedule
▪ Main sections:
▪ Abstract
▪ Introduction
▪ Related work
▪ Problem statement
▪ Multimodal baseline models
▪ Experimental methodology
▪ Results and discussion
▪ Proposed approaches
Lecture objectives
▪ Multimodal alignment
▪ Implicit
▪ Explicit
▪ Explicit signal alignment
▪ Dynamic Time Warping
▪ Canonical Time Warping
▪ Attention models in deep learning (implicit and
explicit alignment)
▪ Soft attention
▪ Hard attention
▪ Spatial Transformer Networks
Multi-modal alignment
7
Multimodal-alignment
Fancy algorithm
Phrases/words of translated sentences
▪ Two types t3 t5
▪ Explicit – alignment is the task in itself
▪ Latent – alignment helps when solving a
different task (for example “Attention”
models)
tn tn
Explicit multimodal-alignment
Applications:
- Re-aligning asynchronous
data
- Finding similar data across
modalities (we can estimate
the aligned cost)
- Event reconstruction from
multiple sources
Let’s start unimodal – Dynamic Time Warping
𝑦
▪ Where 𝒑𝑥 and 𝒑 are index vectors of same
length
▪ Finding these indices is called Dynamic Time
Warping
Dynamic Time Warping continued
𝑦
(𝒑𝑡𝑥 , 𝒑𝒕 )
𝑦
(𝒑1𝑥 , 𝒑1 )
DTW alternative formulation
𝑙
2
𝑥 𝑦
𝐿(𝒑 , 𝒑 ) = 𝒙𝒑𝑥 − 𝒚𝒑𝑦
𝑡 𝑡 2
𝑡=1 Replication doesn’t change the objective!
= 𝐗𝐖𝑥
=
= 𝐘𝐖y
Alternative objective:
2
𝑿, 𝒀 – original signals (same #rows, possibly
𝐿(𝑾𝒙 , 𝑾𝒚 ) = 𝑿𝑾𝑥 − 𝒀𝑾𝑦 different #columns)
𝐹
𝑾𝑥 , 𝑾𝑦 - alignment matrices
2 2
Frobenius norm 𝑨 𝐹 = σ𝑖 σ𝑗 𝑎𝑖,𝑗
16
DTW - limitations
▪ Computationally complex
m sequences
▪ Sensitive to outliers
▪ Unimodal!
Canonical Correlation Analysis reminder
maximize: 𝑡𝑟(𝑼𝑻 𝚺𝑿𝒀 𝑽)
projection of Y
Linear projections maximizing
1 correlation
projection of X
2 Orthogonal projections 𝑯𝒙 𝑯𝒚
18
Canonical Correlation Analysis reminder
projection of Y
2
𝐿(𝑼, 𝑽) = 𝐔𝑇 𝐗 − 𝐕 𝑇 𝐘 𝐹
𝑯𝒙 𝑯𝒚
··· ···
𝑼 𝑽
··· ···
Text Image
𝑿 𝒀
Canonical Time Warping
[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009]
Canonical Time Warping
2
𝐿(𝑼, 𝑽, 𝑾𝒙 , 𝑾𝒚 ) = 𝐔 𝑇 𝐗𝐖𝐱 − 𝑇
𝐕 𝐘𝐖𝐲
𝐹
𝑾𝒙 , 𝑾𝒚 𝑼, 𝑽
Gauss-Newton
[Canonical Time Warping for Alignment of Human Behavior, Zhou and De la Tore, 2009, NIPS]
Generalized Time warping
Weizmann
Subject 1: 40 frames
Subject 2: 44 frames
Subject 3: 43 frames
23
Alignment examples (multimodal)
Canonical time warping - limitations
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
2
𝐿(𝜽1 , 𝜽2 , 𝑾𝒙 , 𝑾𝒚 ) = 𝑓𝜽1 (𝐗)𝐖𝐱 − 𝑓𝜽1 (𝐘)𝐖𝐲
𝐹
Decode
Context /
embedding /
Encoder sentence
representation
le chien sur la plage
Machine Translation with RNNs
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog
Context 𝒛𝟎
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
Dog on
Context 𝒛𝟏
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
▪ Before encoder would just take the final hidden state, now we
actually care about the intermediate hidden states
on the
Context 𝒛𝟐
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓
▪ Before:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛), where 𝒛 = 𝒉 𝑇 ,
and 𝒔𝑖 - the current state of the decoder
▪ Now:
▪ 𝑝 𝑦𝑖 𝑦1 , … , 𝑦𝑖−1 , 𝒙 = 𝑔(𝑦𝑖−1 , 𝒔𝑖 , 𝒛𝑖 )
▪ Have an attention “gate”
▪ A different context 𝒛𝑖 used at each time step!
𝑇
▪ 𝒛𝑖 = σ𝑗=𝑖
𝑥
𝛼𝑖𝑗 𝒉𝑗
Where:
▪ 𝑒𝑖𝑗 = 𝒗𝑇 𝜎 𝑊𝑠𝑖−1 + 𝑈ℎ𝑗
a feedforward network that can tell us given the current state of
decoder how important the current encoding is now
𝒗, 𝑊, 𝑈– learnable weights
𝑇𝑥
σ
𝑧𝑖 = 𝑗=𝑖 𝛼𝑖𝑗 ℎ𝑗 expectation of the context (a fancy way to
say it’s a weighted average)
MT with attention
Distribution
Output
over L
word
locations
𝑎1 𝑎2 𝑑1 𝑎3 𝑑2
𝑠0 𝑠1 𝑠2
𝑧1 𝑦0 𝑧2 𝑦1
48
Soft attention
52
Spatial Transformer Networks
53
Spatial Transformer Networks
Network “attends” to
input by predicting 𝜃
54
Spatial Transformer Networks
55
Spatial Transformer Networks
56
Examples on real world data
▪ Combining the Glimpse and the location of the glimpse into a joint network