Deep Cross-Modal Audio-Visual Generation: Lele Chen Sudhanshu Srivastava

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Deep Cross-Modal Audio-Visual Generation

Lele Chen∗ Sudhanshu Srivastava∗


Computer Science Computer Science
University of Rochester University of Rochester
lchen63@cs.rochester.edu ssrivas6@cs.rochester.edu

Zhiyao Duan Chenliang Xu


Electrical and Computer Engineering Computer Science
University of Rochester University of Rochester
zhiyao.duan@rochester.edu chenliang.xu@rochester.edu
arXiv:1704.08292v1 [cs.CV] 26 Apr 2017

ABSTRACT instead of generation. Although joint representations of multiple


Cross-modal audio-visual perception has been a long-lasting topic modalities and their correlations are explored, these methods only
in psychology and neurology, and various studies have discovered need to retrieve samples that exist in a database. They do not, for
strong correlations in human perception of auditory and visual example, need to model the details of the samples, which is required
stimuli. Despite works in computational multimodal modeling, the in data generation. On the contrary, the generation task requires
problem of cross-modal audio-visual generation has not been sys- generating novel images and sounds that are unseen or unheard,
tematically studied in the literature. In this paper, we make the first and is of great interest to many applications, such as creating art
attempt to solve this cross-modal generation problem leveraging the works [8, 33] and zero-shot learning [2]. It requires learning a com-
power of deep generative adversarial training. Specifically, we use plex generative function that produces meaningful outputs. In the
conditional generative adversarial networks to achieve cross-modal case of cross-modality generation, this function has to map from
audio-visual generation of musical performances. We explore dif- one modality space to the other modality space, making the problem
ferent encoding methods for audio and visual signals, and work on even more challenging and interesting.
two scenarios: instrument-oriented generation and pose-oriented Generative Adversarial Networks (GANs) [7] has become an
generation. Being the first to explore this new problem, we com- emerging topic in deep generative models. Inspired by Reed et al.’s
pose two new datasets with pairs of images and sounds of musical work on generating images conditioned on text captions [23], we
performances of different instruments. Our experiments using both design Conditional GANs for cross-modal audio-visual generation.
classification and human evaluations demonstrate that our model Different from their work, we make the networks to handle inter-
has the ability to generate one modality, i.e., audio/visual, from the sensory generation—generate images conditioned on sounds and
other modality, i.e., visual/audio, to a good extent. Our experiments generate sounds conditioned on images. We explore two different
on various design choices along with the datasets will facilitate tasks when generating images: instrument-oriented generation (see
future research in this new problem space. Fig. 1) and pose-oriented generation (see Fig. 10), where the latter
task is treated as fine-grained generation comparing to the former.
KEYWORDS Another key aspect to the success of cross-modal generation is
being able to effectively encode and decode information contained
cross-modal generation, audio-visual, generative adversarial net-
in different modalities. For images, Convolutional Neural Networks
works
(CNNs) are known to perform well in various tasks. Therefore,
we train a CNN and use the fully connected layer before softmax
1 INTRODUCTION as the image encoder and use several deconvolution layers as the
Cross-modal perception, or intersensory phenomenon, has been decoder/generator. For sounds, we also use CNNs to encode and
a long-lasting research topic in numerous disciplines such as psy- decode. The input to the networks, however, cannot be the raw
chology [3, 28, 30, 31] , neurology [27], and human-computer in- waveforms. Instead, we first transform the time-domain signal into
teraction [14, 29], and recently gained attention in computer vi- the time-frequency or time-quefrency domain. We explore five
sion [17], audition [10] and multimedia analysis [6, 19]. In this different transformations and find that the log-mel spectrogram
paper, we focus on the problem of cross-modal audio-visual gener- gives the best result.
ation. Our system is trained with pairs of visual and audio signals, To explore this new problem space, we compose two datasets,
which are typically contained in videos, and is able to generate one e.g., Sub-URMP and INIS. The Sub-URMP dataset consists of paired
modality (visual/audio) given observations from the other modality images and sounds extracted from 107 single-instrument musical
(audio/visual). Fig. 1 shows results generated by our system on a performance videos of 13 kinds of instruments in the University
musical performance video dataset. of Rochester Musical Performance (URMP) dataset [11]. In total
Learning from multimodal input is challenging—despite the 17,555 images are extracted and each image is paired with a half-
many works in cross-modal analysis, a large portion of the ef- second long sound clip. The INIS dataset contains ImageNet [4]
fort, e.g., [6, 19, 21, 32], has been focused on indexing and retrieval images of five music instruments, e.g., drum, saxophone, piano,
guitar and violin. We pair each image with a short sound clip of
∗ These authors contributed equally to this work. a solo performance of the corresponding instrument. We conduct
Bassoon Cello Clarinet Double bass Horn Oboe Trombone Trumpet Tuba Viola Violin Saxphone Flute

S2I-C

S2I-A

S2I-N

I2S

Figure 1: Generated outputs using our cross-modal audio-visual generation models. Top three rows are musical performance
images generated by our Sound-to-Image (S2I) networks from audio recordings. S2I-C is our main model. S2I-A and S2I-N are
variations of our main model. Bottom row contains the log-mel spectrograms of generated audio of different instruments
from musical performance images using our Image-to-Sound (I2S) network. Each column represents one instrument type.

experiments to evaluate the quality of our generated images and and enhancing speech [18]. We also use adversarial training but
sound spectrograms using both classification and human evaluation. on a novel problem of cross-modal audio-visual generation with
Our experiments demonstrate that our conditional GANs can, in- music instruments and human poses that differs from other works.
deed, generate one modality (visual/audio) from the other modality
(audio/visual) to a good extent at both the instrument-level and the 2.1 Background
pose-level. We also compare and evaluate various design choices Generative Adversarial Networks (GANs) are introduced in the
in our experiments. seminal work of Goodfellow et al. [7], and consist of a generator
The contributions are three-fold. First, to our best knowledge, we network G and a discriminator network D. Given a distribution,
introduce the problem of cross-modal audio-visual generation and G is trained to generate samples that are resembled from this dis-
are the first to use GANs on intersensory generation. Second, we tribution, while D is trained to distinguish whether the sample
propose newnetwork structures and adversarial training strategies is genuine. They are trained in an adversarial fashion playing a
for cross-modal GANs. Third, we compose two datasets that will min-max game against each other:
be released to facilitate future research in this new problem space.
The paper is organized as follows. We discuss related work and min max V (D, G) = Ex ∼pd at a (x ) [log D(x)]+ (1)
G D
background in Sec. 2. We introduce our network structure, training Ex ∼pz (z) [log(1 − D(G(z)))] ,
strategies and encoding methods in Sec. 3. We present our datasets
in Sec. 4 and experiments in Sec. 5. Finally, we conclude our paper where pdat a is the target data distribution and z is drawn from a
in Sec. 6. random noise distribution pz .
Conditional GANs [5, 15] are variants of GANs, where one is in-
terested in directing the generation conditioned on some variables,
2 RELATED WORK
e.g., labels in a dataset. It has the following form:
Our work differs from the various works in cross-modal retrieval
[6, 19, 21, 32] as stated in Sec. 1. In this section, we further distin- min max V (D, G) = Ex ∼pd at a (x ) [log D(x |y)]+ (2)
G D
guish our work from those in multimodal representation learning. Ex ∼pz (z) [log(1 − D(G(z|y)))] ,
Ngiam et al. [16] learn a shared representation between audio-visual
modalities by training a stacked multimodal autoencoder. Srivas- where the only difference from GANs is the introduction of y that
tava and Salakhutdinov [26] propose a multimodal deep Boltzmann represents the condition variable. This condition is passed to both
machine to learn a joint representation of images and their text the generator and the discriminator networks. One particular ex-
tags. Kumar et al. [9] learn an audio-visual bimodal compositional ample is [23], where they use conditional GANs to generate images
model using sparse coding. Our work differs from them by using conditioned on text captions. The text captions are encoded through
the adversarial training framework that allows us to learn a much a recurrent neural network as in [22]. In this paper, we use condi-
deeper representation for the generator. tional GANs for cross-modal audio-visual generation.
Adversarial training has recently received a significant amount
of attention [1, 5, 7, 13, 20, 23, 24]. It has been shown to be effective 3 CROSS-MODAL GENERATION MODEL
in various tasks, such as generating semantic segmentations [12, 25], The overall diagram of our model is shown in Fig. 2, where we have
improving object localization [1], image-to-image translation [8] separate networks for Sound-to-Image (S2I) and Image-to-Sound
2
(a) Sound-to-Image (S2I) network
Wave file LMS Sound Encoder
Classification

FC Layer FC Layer '( A )

Flatten FC-layer '(⇥A ) z ~ (0,1) … …


… ⇥ ⇥ ⇥ ⇥⇥
v
1024 1 128 1 64 1 10 1 ⇥ ⇥
Wave file LMS Conv layers DeConv layers
v
Conv layers
(512+64) 4 4 1 4 4 D(X ,' A ) ( )

XI
Sound Encoder Generator Discriminator

(b) Image-to-Sound (I2S) network


Classification
Image Encoder
FC Layer
( I)

Flatten FC-layer z ~ (0,1) … …


⇥I
… ( )
1024⇥1 128⇥1 ⇥ ⇥⇥ ⇥⇥ v
DeConv layers
Cov layers
64 1 10 1 v
Conv layers
(512+64) 4 4 1 4 4
D( X , ( I ))
XA

Image Encoder Generator Discriminator

Figure 2: The overall diagram of our model. This figure consists of an S2I GAN network (a) and an I2S GAN network (b). Each
network contains an encoder, a generator and a discriminator respectively.

(I2S). Each of them consists of three parts: an encoder network, a a compressed image encoding vector and produces a score for this
generator network, and a discriminator network. We describe the pair being a genuine pair of sound and image.
generator and discriminator networks in Sec. 3.1, and their training Our implementation is based on the GAN-CLS by Reed et al. [23].
strategies in Sec. 3.2. We present the encoder networks for sound We extend it to handle the challenges in operating sound spectro-
and image in Sec. 3.3 and Sec. 3.4, respectively. grams which have a rectangle size. For the I2S generator network,
after getting a 32x32x128 feature map, we apply two successive
deconvolution layers, where each has a kernel of size 4x4 with
3.1 Generator and Discriminator Networks stride 2x1 and 1x1 zero-padding, and obtain a matrix of size 128x34.
We apply the numpy resize function to get a matrix of size 128x44
S2I Generator The S2I generator network is denoted as: G S 7→I :
for comparing with ground-truth spectrogram in evaluation. The
R |φ(A) | × RZ 7→ RI . The sound encoding vector of size 128 is
I2S discriminator network takes sound spectrogram of size 128x34.
first compressed to a vector of size 64 via a fully connected layer
To handle ground-truth spectrogram, we resize it from 128x44 to
followed by a leaky ReLU, which is denoted as φ(A). Then it is
128x34. We apply two successive convolution layers, where each
concatenated with a random noise vector z ∈ RZ . The generator
has a kernel of size 4x4 with stride 2x1 and 1x1 zero-padding. This
takes this concatenated vector and produces a synthetic image
results in a 32x32 square feature map. In practice, we have observed
x̂ I ← G S 7→I (z, φ(A)) of size 64x64x3.
that adding more convolution layers in the I2S networks helps get
S2I Discriminator The S2I discriminator network is denoted as:
better output in fewer epochs. We add two layers to the generator
D S 7→I : RI × R |φ(A) | 7→ [0, 1]. It takes an image and a compressed
network and 12 layers to the discriminator network.
sound encoding vector and produces a score for this pair being a
genuine pair of image and sound.
I2S Generator Similarly, the I2S generator network is denoted 3.2 Adversarial Training Strategies
as: G I 7→S : R |ϕ(I ) | × RZ 7→ RA . The image encoding vector of size Without loss of generality, we assume that the training set con-
128 is compressed to size 64 via a fully connected layer followed j j j
tains pairs of images and sounds {(Ii , Ai )}, where Ii represents
by a leaky ReLU, denoted as ϕ(I ), and concatenated with a noise z. the jth image of the ith instrument category in our dataset and
The generator takes it and do a forward pass to produce a synthetic j
Ai represents the corresponding sound. Here, i ∈ {1, 2, 3 . . . , 13}
sound spectrogram x̂ A ← G I 7→S (z, ϕ(I )) of size 128x34. represents the index to one of the music instruments in our dataset,
I2S Discriminator The I2S discriminator network is denoted e.g., cello or violin. Notice that even images and sounds within the
as: D I 7→S : RA × R |ϕ(I ) | 7→ [0, 1]. It takes a sound spectrogram and same music instrument category differ in terms of the player, pose,
3
and music note. We use I−i to represent the set of all images of
instruments of all the categories except the ith category, and use
−j
Ii to represent the set of all images in the ith instrument category
−j
except the jth image. The sound counterparts, A−i and Ai , are
defined likewise.
Based on the input, we define three kinds of discriminator out- Wave STFT MS

puts: Sr , S f and Sw . Here, Sr is the score for a true pair of image


and sound that is contained in our training set, and S f is the score
for the pair where one modality is generated based on the other
modality, and Sw is the score for the wrong pair of image and sound.
Wrong pairs are sampled from the training dataset. The generator
network is trained to maximize:
MFCC CQT LMS
log(S f ) , (3)
and the discriminator is trained to maximize: Figure 3: Different representations of audio that are fed to
the encoder network. The horizontal axis is time and the
log(Sr ) + (log(1 − Sw ) + log(1 − S f ))/2 . (4)
vertical axis is amplitude (for Wave), frequency (for STFT,
Notice that by using different types of wrong pairs, we can eventu- MS, CQT, and LMS) or quefrency (for MFCC).
ally guide the generator in solving various tasks.
S2I Generation (Instrument-Oriented) We train a single S2I
model over the entire dataset so that it can generate musical perfor- I2S Generation We train a single I2S model over the entire
mance images of different instruments from different input sounds. dataset so that it can generate sound magnitude spectrograms of
In other words, the same model can generate an image of person- different instruments from different musical performance images.
playing-violin from an unheard sound of violin, and can generate In other words, the same model can geneFor example, the model
an image of person-playing-saxophone from an unheard sound of generates a sound spectrogram of drum given an image that has a
saxophone.We apply the following training settings: person playing drum. The generator should not make mistakes on
j the type of instruments while generating convertible spectrogram
x̂ I ← G S 7→I (φ(Ai ), z) to realistic sounds. In this case, we set the training as following:
j
S f = D S 7→I (x̂ I , φ(Ai )) j
x̂ A ← G I 7→S (ϕ(Ii ), z)
j j
Sr = D S 7→I (Ii , φ(Ai )) j
S f = D I 7→S (x̂ A , ϕ(Ii ))
j
Sw = D S 7→I (ω(I−i ), φ(Ai )) , (5) j j
Sr = D I 7→S (Ai , ϕ(Ii ))
where x̂ I is the synthetic image of size 64x64x3, z is the random j
j Sw = D I 7→S (ω(A−i ), ϕ(Ii )) . (7)
noise vector and φ(Ai ) is the compressed sound encoding. ω(·) is a
random sampler with a uniform distribution, and it samples images Recall that x̂ A is the generated sound spectrogram with size 128x34,
j
from the wrong instrument category to construct wrong pairs for and ϕ(Ii ) is the compressed image encoding. We use the image-to-
calculating Sw . We use the sound-to-image network structure as in sound network as in Fig. 2 (b).
Fig. 2 (a).
S2I Generation (Pose-Oriented) We train a set of S2I models 3.3 Sound Encoder Network
with one for each music instrument category. Each model captures The sound files are sampled at 44,100 Hz. To encode sound, we
the relations between different human poses and input sounds of first transform the raw audio waveform into the time-frequency
one instrument. For example, the model trained on violin image- or time-quefrency domain. We explore a set of representations
sound pairs can generate a series of images of person-playing-violin including the Short-Time Fourier Transform (ST FT ), Constant-Q
with different hand movements according to different violin sounds. Transform (CQT ), Mel-Frequency Cepstral Coefficients (MFCC),
This is a fine-grained generation task compared to the previous Mel-Spectrum (MS) and Log-amplitude of Mel-Spectrum (LMS).
instrument-oriented task. We apply the following training settings: Figure 3 shows images of the above-mentioned representations for
j the same sound. We can see that LMS shows clearer patterns than
x̂ I ← G S 7→I (φ(Ai ), z)
other representations.
j
S f = D S 7→I (x̂ I , φ(Ai ))
j j Accuracy MS LMS CQT MFCC STFT
Sr = D S 7→I (Ii , φ(Ai ))
3 layers 62.01% 84.12 % 73.00% 80.06% 74.05%
−j j
Sw = D S 7→I (ω(Ii ), φ(Ai )) , (6) 4 layers 66.09% 87.44 % 77.78% 81.05% 75.73%
where the main difference from Eq. (5) is that here in constructing Table 1: Accuracy of audio classifier. We apply three Conv
the wrong pairs we sample images from wrong images in the correct layers and four Conv layers respectively and it shows the
−j
instrument category, Ii , instead of images in wrong instrument best performance is using four Conv layers.
categories, I−i . Again, we use the network structure as in Fig. 2 (a).
4
We further run a CNN-based classifier on these different repre- Input Output
sentations. We use four convolutional layers and three fully con-
nected layers (see Fig. 4). In order to prevent overfitting, we add Softmax
penalties (l2 = 0.015) on layer parameters in fully connected layers, 16 filters 3x3
Conv Fully Connected (l2)
and we apply dropout (0.7 and 0.8 respectively) to the last two size: 13
layers. The classification accuracies obtained by different repre- Relu
sentations are shown in Table 1. We can see that LMS shows the 16 filters 3x3 Dropout= 0.8 Relu
highest accuracy. Therefore, we chose LMS over other representa- Conv
Fully Connected (l2)
tions as the input to the audio encoder network. Furthermore, LMS 2x2 Maxpooling size: 128
is smaller in size as compared to ST FT , which saves the running
32 filters 3x3 Dropout= 0.7 Relu
time. Finally, we feed the output of the FC layer (size: 1x128) of Conv
CNNs classifier to GAN network as audio feature. Relu Flattened
Further merit of LMS is detailed in the experiment section. We Fully Connected (l2)
32 filters 3x3 size: 1024
thus choose LMS to represent the audio. To calculate LMS, a Short- Conv
Time Fourier Transform (STFT) with a 2048-point FFT window
Relu 2x2 Maxpooling
with a 512-point hop size is first applied to the waveform to get the
linear-amplitude linear-frequency spectrogram. Then a mel-filter 64 filters 3x3 64 filters
Conv Relu
3x3 Conv
bank is applied to warp the frequency scale into the mel-scale, and
the linear amplitude is converted to the logarithmic scale as well.
Figure 5: Image classifier trained with instrument category
loss.
Input Output

Softmax
4 kernels
3x3 Conv Fully Connected (l2) are recorded videos of 1 to 5 persons playing different music pieces
size: 13 (see Fig. 6). We separate videos into 80% for training and 20%
Relu
8 kernels Dropout= 0.8 Relu
for testing and ensure that a video will not appear in both train-
3x3 Conv ing and testing sets. We segment the videos into small chunks
Relu Fully Connected (l2) with a 0.5 second duration. We use the first frame in each chunk
size: 128
16 kernels to represent the matching image of the audio. We calculate the
3x3 Conv Dropout= 0.7 Relu loudness (Γ, unit: dBFS) for all audio chunks using the formula
Relu Γ = 20 ∗ loд10 (|ψ |/max(ψ )), where ψ is the matrix after loading
Flattened
16 kernels Fully Connected (l2)
wave file into numpy array. We set a threshold (Θ = −45 dBFS)
3x3 Conv Relu size: 1024 and delete chunks having Γ ≤ Θ. Finally, there are a total of 17, 555
sound-image pairs in our composed Sub-URMP dataset. The basic
Figure 4: Audio classifier trained with instrument category information is shown as Table 2. We use this dataset as our main
loss. dataset to evaluate models in Sec. 5.
All images in the INIS dataset are collected form ImageNet,
shown in Fig. 7. There are five categories, and each contains roughly
1200 images. In order to eliminate noise, all images are screened
3.4 Image Encoder Network manually. Audio files of this dataset come form a total of 77 solo
For encoding images, we train a CNN with six convolutional layers performances downloaded from the Internet, such as a piano per-
and three fully connected layers (see Fig. 5). All the convolution formance of the Moonlight Sonata and a violin performance of the
kernels are of size 3x3. The last layer is used for classification with Preludio. We sample 7200 small audio chunks from all songs with
softmax loss. This CNN image classifier achieves a high accuracy of each having 0.5 second duration. We match the audio chunks to
more than 95 percent on the testing set. After the network is trained, the instrument images to create manual sound-image pairs. Table 3
its last layer is removed, and the feature vector of the second to shows the statistics of this dataset.
the last layer having size 128 is used as the image encoding in our
GAN network. 5 EXPERIMENTS
4 DATASETS We first introduce our model variations in Sec. 5.1. Then we present
our evaluation on instrument-oriented Sound-to-Image (S2I) gen-
To the best of our knowledge, there is no existing dataset that we eration in Sec. 5.2, pose-oriented S2I generation in Sec. 5.3 and
can directly work on. Therefore, we compose two novel datasets Image-to-Sound (I2S) generation in Sec. 5.4.
to train and evaluate our models, and they are a Subset of URMP
(Sub-URMP) dataset and a ImageNet Image-Sound (INIS) dataset.
Sub-URMP dataset is composed from the original URMP dataset [11]. 5.1 Model Variations
It contains 13 music instrument categories. In each category, there We have three variations for our sound-to-image network.
5
Bassoon Cello Clarinet Double bass Horn Oboe Sax Trombone Trumpet Tuba Viola Violin Flute

Figure 6: Example in the Sub-URMP dataset. Each category contains roughly 6 different complete solo songs.

Category Cello Double Bass Oboe Sax Trumpet Viola Bassoon Clarinet Horn Flute Trombone Tuba Violin
Training set 1619 448 626 1203 1138 1708 138 1308 774 1820 1433 855 1263
Testing Set 289 245 465 217 285 177 260 337 145 327 278 136 341
Table 2: Number of image-sound pairs in the Sub-URMP dataset.

Drum Saxphone Piano Guitar Violin Score Meaning


3 Realistic image & match instrument
2 Realistic image & mismatch instrument
Ground 1 Fair image (player visible, instrument not visible)
Truth 0 Unrealistic image
Table 4: Scoring guideline of human evaluation.

Generated
Image 5.2 Evaluating Instrument-Oriented S2I
Generation
We show qualitative examples in Fig. 1 for S2I generation. It can
Figure 7: Examples from the INIS dataset. Bottom row con- be seen that the quality of the images generated by S2I-C is bet-
tains generated images by our S2I-A model . Due to large ter than its variations. This is because the classifier is explicitly
variation, images are not as good as those generated in the trained to classify the instruments in sound. Therefore, when this
Sub-URMP dataset. encoding is given as a condition to the generator network, it faces
less ambiguity in deciding what to generate. Furthermore, while
training the classifier, we observe the classification accuracy, which
Category Piano Saxphone Violin Drum Guitar
is a direct measurement of how discriminative the encoding is. This
Complete songs 23 7 21 7 19 is not true in the case of autoencoder. We know the loss function
Training set 766 1171 631 1075 818 value, but we do not know if it is a good condition feature in our
Testing set 327 500 269 460 349 conditional GANs.
Table 3: Distribution of image-sound pairs in INIS dataset
5.2.1 Human Evaluation. We have human subjects evaluate
our sound-to-image generation. They are given 10 sets of images for
each instrument. Each set contains four images; they are generated
by S2I-C, S2I-N and S2I-A and a ground-truth image to calibrate
S2I-C network This is our main sound-to-image network that the scores. Human subjects are well-informed about the music
uses classification-based sound encoding. The model is described instrument category of the image sets. However they are not aware
in Sec. 3. the mapping between images to methods. They are asked to score
S2I-N network This model is a variation of the S2I-C network. It the images on a scale of 0 to 3, where the meaning of each score is
uses the same sound encoding but it is trained without the mismatch given in Table. 4.
Sw information (see Eq. 5). Figure 8 shows the results of human evaluation. More than half
S2I-A network This model is a variation of the S2I-C network of all images generated by S2I-C are considered as realistic by our
and differs in that it uses autoencoder-based sound encoding. Here, human subjects, i.e. getting score 2 or 3. One third of them get score
we use a stacked convolution-deconvolution autoencoder to encode 3. This is much higher than S2I-N and S2I-A. In terms of mean score,
sound. We use four stacks. For the first three stacks, we apply S2I-C gets 1.81 where the ground-truth gets 2.59 due to small size;
convolution and deconvolution, where the output of convolution is all images are evaluated at size 64x64.
given as input to the next layer in stacks. In the last stack, the input Images from three instruments in particular were rated of very
(a 2D array of shape 120x36) is flattened and projected to a vector high score among all images generated by S2I-C. Out of 30 Cello
of size 128 via a fully connected layer. The network is trained to images, 18 received highest score of 3, while 25 received scores
minimize MSE for all stacks in order. of 2 or above. Cello images received an average score of 1.9. Out
6
Average Score
: 2.59
: 1.81
: 1.16
: 0.84
Human Evaluation Score

Accuracy
Vote number

Epoch number
Accuracy (a) Accuracy (b)
Figure 8: Result of human evaluation on generated images.
The upper right shows average scores of S2I GANs on human
Figure 9: Evolution of image quality and classification accu-
evaluation.
racy on generated images versus number of epochs. Accu-
racy (a) are the rates that fake images generated in training
set of S2I-C are classified into right category by using classi-
of 30 Flute images, 15 received highest possible score of 3, while
fier Γ. Accuracy (b) are the rates that fake images generated
24 received a score of 2 or above. Flute images also received an
in testing set of S2I-C are classified into right category by
average score of 2.1. Out of 30 Double-Bass images, 18 received
using classifier Γ.
score 3, while 21 got a score of 2 or more. The average score that
Double-Bass images got was 2.02.
Mode S2I-C S2I-A S2I-N
5.2.2 Classification Evaluation. We use the classifier used Training Set 87.37% 10.63% 12.62%
for encoding images (see Fig. 5) for evaluating our generated images. Testing Set 75.56% 10.95% 12.32%
When classifying real images, the accuracy of classifier is above Table 5: Classifier-based Evaluation Accuracy for Images
95%, thus we decide to use this classifier (Γ) to verify whether the
generated (fake) images are classifiable and they belong to the
expected instrument categories. We calculate the accuracies on
images generated by S2I-C, S2I-A and S2I-N. Table. 5 shows the not classified correctly. Hence we get a higher accuracy than bad
results. It shows that the accuracy of S2I-A and S2I-N is far worse images, but still not as high as correctly classified, high-quality
than the accuracy of S2I-C. images.
It is interesting to note that even the fifth epoch has much higher
5.2.3 Evolution of Classification Accuracy. Figure 9 shows
training and testing accuracies than any epoch after 40. What this
the classification accuracy on images generated in both training set
means is that, even after as few as 5 epochs, not only are the images
and testing set. It is plotted for every fifth epoch. The model used
getting aligned with the expected category, the generated images
for plotting this figure is our main S2I-C network. We visualize
have enough quality that a classifier can extract distinguishing
generated images for a few key moments in the figure. It shows
features from them. This is not true in the case of a random image
that the accuracy increases rapidly up till the 35th epoch, and then
like the ones in epoch 50.
begins to fall sharply till the 50th epoch, after which it again picks
up a little, although the accuracy is still much lower than the peak
accuracy. The training and testing accuracies follow nearly the
5.3 Evaluating Pose-Oriented S2I Generation
same trend. The model and the training strategy for our pose-oriented S2I gener-
One potential reason is that the discriminator loses both classifi- ation is described in Sec. 3.2. The results we got were encouraging:
cation power and the power to tell fake images apart around epoch various poses can be observed in the generated images (see Fig. 10).
50. Thereafter, it recovers the ability to tell fake images, although Note that for sound encoding, we used the same image classifier
not its discriminator power—the slightly higher accuracy is a result as S2I-C. It is trained to classify various instruments, not various
of generating the same image with minor variations for all the input poses. With a classifier that is trained to classify music notes, we
audios—thus at least some are classified correctly. This can be seen expect the results to better match the expected poses.
in the attached images. At epoch 50 we have a totally random look-
ing image, while at 60 we can see that the Cello image looks like 5.4 Evaluating I2S Generation
the Cello image in the dataset, while the other image, which was When converting LMS back into waveform files, we will lose high
supposed to be Flute, looks like the Clarinet image from the dataset. frequency part as the Mel filter filtering is not reversible. Therefore,
Thus, while the images look like images from the dataset, they are we conduct evaluation on generated sound spectrograms. We use
7
Limitation and Future Work. While our I2S model generates
LMS, the accuracy is low. Furthermore, it would be worthwhile
to hire experts to listen to the ground-truth audio waveform files
reconstructed from the generated LMS spectrograms. On the other
hand, we are able to generate various poses using our S2I network,
but it is hard to quantify how good the generation is. Strengthening
the Autoencoder would enable accurate unsupervised generation.
The present autoencoder appears to be limited in terms of extracting
good representations. It is our future work to explore all these
directions.

REFERENCES
[1] Sima Behpour and Brian D Ziebart. 2016. Adversarial methods improve object
localization. In Advances in Neural Information Processing Systems Workshop.
Figure 10: Generated pose image. First row shows playing [2] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An Empiri-
viola, the head position is corresponding to the arm move- cal Study and Analysis of Generalized Zero-Shot Learning for Object Recognition
ment. Second and third row are violin, it indicates that in the Wild. In European Conference on Computer Vision.
[3] Richard K Davenport, Charles M Rogers, and I Steele Russell. 1973. Cross modal
one single model can generate multiple persons in differ- perception in apes. Neuropsychologia 11, 1 (1973), 21–28.
ent poses. Fourth row is cello, the whole move range is very [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
genet: A large-scale hierarchical image database. In IEEE Conference on Computer
large. In one single category, songs in training set and test- Vision and Pattern Recognition.
ing set are collected from different videos. [5] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep
generative image models using a Laplacian pyramid of adversarial networks. In
Advances in Neural Information Processing Systems.
[6] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with
correspondence autoencoder. In ACM International Conference on Multimedia.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Good
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
Example nets. In Advances in Neural Information Processing Systems.
[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-
to-image translation with conditional adversarial networks. Technical Report.
arXiv:1611.07004.
[9] S. Kumar, V. Dhiman, and J. J. Corso. 2014. Learning compositional sparse models
of bimodal percepts. In AAAI Conference on Artificial Intelligence.
Bad [10] Bochen Li, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2017. See and
Example
listen: score-informed association of sound tracks to players in chamber music
performance videos. In IEEE International Conference on Acoustics, Speech and
real image Real LMS Fake LMS real image Real LMS Fake LMS
Signal Processing.
[11] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma.
2016. Creating A Musical Performance Dataset for Multimodal Music Analysis:
Figure 11: Generated sound spectrogram and ground-truth. Challenges, Insights, and Applications. In arXiv:1612.08727.
[12] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Seman-
tic Segmentation using Adversarial Networks. Technical Report. arXiv:1611.08408.
the sound classifier (see Fig. 4) which is trained to encode sound [13] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan
Frey. 2016. Adversarial autoencoders. In International Conference on Learning
for image generation. The reason we use this model is because the Representations.
model is trained on real LMS, and achieves a high accuracy of 80% [14] Christophe Mignot, Claude Valot, and Noelle Carbonell. 1993. An experimen-
tal study of future “natural” multimodal human-computer interaction. In IN-
on testing set of real LMS. We achieve 11.17% classification accu- TERACT’93 and CHI’93 Conference Companion on Human Factors in Computing
racy on the generated LMS. One factor that might be affecting the Systems.
accuracy is that we generate spectrogram of size 128x34 and resize [15] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets.
Technical Report. arXiv:1411.1784.
them to 128x44 in classification. Furthermore, Figure 11 shows the [16] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and An-
generated LMS compared to real LMS. We can see, in fake LMS, drew Y Ng. 2011. Multimodal deep learning. In International Conference on
there is less energy in high frequency domain, more energy in low Machine Learning.
[17] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H
frequency domain, same as real LMS. Adelson, and William T Freeman. 2016. Visually indicated sounds. In IEEE
Conference on Computer Vision and Pattern Recognition.
[18] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech En-
6 CONCLUSION hancement Generative Adversarial Network. Technical Report. arXiv:1703.09452.
In this paper, we introduce the problem of cross-modal audio-visual [19] Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert RG
Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation
generation and make the first attempt to use conditional GANs on and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern
intersensory generation. In order to evaluate our models, we com- Analysis and Machine Intelligence 36, 3 (2014), 521–535.
pose two novel datasets, i.e., Sub-URMP and INIS. Our experiments [20] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised represen-
tation learning with deep convolutional generative adversarial networks. In
demonstrate that our model can, indeed, generate one modality (vi- International Conference on Learning Representations.
sual/audio) from the other modality (audio/visual) to a good extent [21] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG
at both instrument-level and pose-level. For example, our model is Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-
modal multimedia retrieval. In ACM International Conference on Multimedia.
able to generate pose of a cello player given the note that is being [22] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning
played. deep representations of fine-grained visual descriptions. In IEEE Conference on
8
Computer Vision and Pattern Recognition.
[23] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
and Honglak Lee. 2016. Generative adversarial text to image synthesis. In
International Conference on Machine Learning.
[24] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,
and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural
Information Processing Systems.
[25] Nasim Souly, Concetto Spampinato, and Mubarak Shah. 2017. Semi and Weakly
Supervised Semantic Segmentation Using Generative Adversarial Network. Techni-
cal Report. arXiv:1703.09695.
[26] Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with
deep boltzmann machines. In Advances in Neural Information Processing Systems.
[27] Barry E Stein and M Alex Meredith. 1993. The merging of the senses. The MIT
Press.
[28] Russell L Storms. 1998. Auditory-visual cross-modal perception phenomena. Ph.D.
Dissertation. Naval Postgraduate School.
[29] M Iftekhar Tanveer, Ji Liu, and M Ehsan Hoque. 2015. Unsupervised extraction
of human-interpretable nonverbal behavioral cues in a public speaking scenario.
In ACM International Conference on Multimedia.
[30] Bradley W Vines, Carol L Krumhansl, Marcelo M Wanderley, and Daniel J Lev-
itin. 2006. Cross-modal interactions in the perception of musical performance.
Cognition 101, 1 (2006), 80–113.
[31] Jean Vroomen and Beatrice de Gelder. 2000. Sound enhances visual perception:
cross-modal effects of auditory organization on vision. Journal of experimental
psychology: Human perception and performance 26, 5 (2000), 1583.
[32] Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A Compre-
hensive Survey on Cross-modal Retrieval. Technical Report. arXiv:1607.06215.
[33] Hang Zhang and Kristin Dana. 2017. Multi-style Generative Network for Real-
time Transfer. In arXiv:1703.06953.

You might also like