Such Papers

S2IGAN: Speech-to-Image Generation via Adversarial Learning
Xinsheng Wang1,2 , Tingting Qiao2,3 , Jihua Zhu1 ( ) , Alan Hanjalic2 , Odette Scharenborg2
1
School of Software Engineering, Xi’an Jiaotong University, China.
2
Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands.
3
College of Computer Science and Technology, Zhejiang University, China.
wangxinsheng@stu.xjtu.edu.cn, qiaott@zju.edu.cn, zhujh@xjtu.edu.cn, a.hanjalic@tudelft.nl,
o.e.scharenborg@tudelft.nl
Abstract speech to images might be more efficient and straightforward.

In order to synthesize plausible images based on speech de-
An estimated half of the world’s languages do not have a written scriptions, speech embeddings that carry the details of semantic
form, making it impossible for these languages to benefit from information in the image need to be learned. To that end, we de-
any existing text-based technologies. In this paper, a speech-to- compose the task of S2IG into two stages, i.e., a speech semantic
arXiv:2005.06968v2 [cs.LG] 15 Sep 2020
image generation (S2IG) framework is proposed which trans- embedding stage and an image generation stage. Specifically,
lates speech descriptions to photo-realistic images without us- the proposed speech-to-image generation model via adversarial
ing any text information, thus allowing unwritten languages to learning (which we refer to as S2IGAN) consists of a Speech
potentially benefit from this technology. The proposed S2IG Embedding Network (SEN), which is trained to obtain speech
framework, named S2IGAN, consists of a speech embedding embeddings by modeling and co-embedding speech and im-
network (SEN) and a relation-supervised densely-stacked gen- ages together, and a novel Relation-supervised Densely-stacked
erative model (RDG). SEN learns the speech embedding with Generative Model (RDG), which takes random noise and the
the supervision of the corresponding visual information. Condi- speech embedding embedded by SEN as input to synthesize
tioned on the speech embedding produced by SEN, the proposed photo-realistic images in a multi-step (coarse-to-fine) way.
RDG synthesizes images that are semantically consistent with In this paper, we present our attempt to generate images
the corresponding speech descriptions. Extensive experiments directly from the speech signal bypassing text. This task requires
on datasets CUB and Oxford-102 demonstrate the effective- specific training material consisting of speech and image pairs.
ness of the proposed S2IGAN on synthesizing high-quality and Unfortunately, no such database, with the right amount of data,
semantically-consistent images from the speech signal, yielding exists for an unwritten language. The results for our proof-
a good performance and a solid baseline for the S2IG task. of-concept are consequently presented on two databases with
Index Terms: Speech-to-image generation, multimodal mod- English descriptions, i.e., CUB [14] and Oxford-102 [15]. The
elling, speech embedding, adversarial learning. benefit of using English as our working language is that we can
compare our S2IG results to T2IG results in the literature. Our
1. Introduction results are also compared to those of [12].
The recent development of deep learning and Generative Ad-
versarial Networks (GAN) [1, 2, 3] led to many efforts being 2. Approach
carried out on the task of image generation conditioned on natu- Given a speech description, our goal is to generate an image
ral languages [4, 5, 6, 7, 8, 9]. Although great progress has been that is semantically aligned with the input speech. To this end,
made, most of the existing natural language-to-image generation S2IGAN consists of two modules, i.e., SEN to create the speech
systems use text descriptions as their input, also referred to as embeddings and RDG to synthesize the images using these
Text-to-Image Generation (T2IG). Recently, a speech-based task speech embeddings.
was proposed in which face images are synthesized conditioned
on speech [10, 11]. This task, however, only considers the acous- 2.1. Datasets
tic properties of the speech signal, but not the language content.
Here, we present a natural language-to-image generation system CUB [14] and Oxford-102 [15] are two commonly-used datasets
that is based on a spoken description, bypassing the need for text. in the field of T2IG [4, 5], and were also adopted in the most
We refer to this new task as Speech-to-Image Generation (S2IG). recent S2IG work [12]. CUB is a fine-grained bird dataset that
This is similar to the recently proposed task of speech-to-image contains 11,788 bird images belonging to 200 categories and
translation task [12]. Oxford-102 is a fine-grained flower dataset contains 8,189 im-
This work is motivated by the fact that an estimated half of ages of flowers from 102 different categories. Each image in
the 7,000 languages in the world do not have written forms [13] both datasets has 10 text descriptions collected by [16]. Since
(so-called unwritten languages), which makes it impossible for there are no speech descriptions available for both datasets, we
these languages to benefit from any existing text-based technolo- generated speech from the text descriptions using tacotron2 [17]
gies, including text-to-image generation. The Linguistic Rights which is a text-to-speech system1 .
as included in the Universal Declaration of Human Rights state
that it is a human right to communicate in ones native language. 2.2. Speech Embedding Network (SEN)
For these unwritten languages, it is essential to develop a sys- Given an image-speech pair, SEN tries to find a common space
tem that bypasses text and maps speech descriptions to images. for both modalities, so that we can minimize the modality gap
Moreover, even though existing knowledge and methodology
make ‘speech2text2image’ transfer possible, directly mapping 1 https://github.com/NVIDIA/tacotron2
Densely-stacked Generator (DG) Relation Supervisor (RS)
𝐼መ2𝑅𝐼 (or)
Speech description
h0 h1 h2 𝐼2 (or)
SED L1
G2
Fca
Relation
IED
A Z~N(0,1) L2
𝐼መ2𝑀𝐼 (or) Classifier
G0 G1 L3
IED
𝐼0 𝐼1 R
𝐼መ2𝐺𝑇 (or) 𝐼መ2𝐺𝑇
D0 D1 D2
Figure 1: Framework of the relation-supervised densely-stacked generative model (RDG). Iˆ2RI represents a real image from the same
class as the ground-truth image (Iˆ2GT ), I2 represents a fake image synthesized by the framework. Iˆ2M I represents a real image from a
different class as Iˆ2GT . Li indicates labels for three types of relations. SED and IED are pre-trained in SEN.
and obtain visually-grounded speech embeddings. SEN is a dual P (Vi |Ai ):

encoder framework, including an image encoder and a speech
n
encoder, which is similar to the model structure in [18]. X
LA−V = − logP (Vi |Ai ) . (3)
The image encoder (IED) adopts the Inception-v3 [19] pre-
i=1
trained on ImageNet [20] to extract visual features. On top of it,
a single linear layer is employed to convert the visual feature to Reversely, we also calculate LV −A for Vi matching Ai . The
a common space of visual and speech embeddings. As a result, matching loss is then calculated as
we obtain an image embedding V from IED.
The speech encoder (SED) employs a structure similar to Lm = LA−V + LV −A . (4)
that of [18]. Specifically, it consists of a two-layer 1-D convolu-
tion block, two-layer bi-directional gated recurrent units (GRU) Distinctive loss is designed to ensure that the space is opti-
[21] and a self-attention layer. Finally, speech is represented by mally discriminative regarding the instance classes. Specifically,
a speech embedding A in the common space. The input of the both speech and image features in the embedding space are
SED are log Mel filter bank spectrograms, which are obtained converted to a label space by adding a perception layer, i.e.,
from the speech signal using 40 Mel-spaced filter banks with 25 V̂i = f (Vi ) and Âi = f (Ai ), where V̂i , Âi ∈ RN and N is
ms Hamming window and 10 ms shift. the number of classes. The loss function is given by
More details of SEN, including the framework illustration, n
can be found on the project website2 . X
Ld = − log P̂ Ci |Âi + log P̂ Ci |V̂i , (5)
i=1
2.2.1. Objective Function
To minimize the distance between a matched pair of an image where P̂ (Ci |Âi ) and P̂ (Ci |V̂i ) represent softmax probabilities
feature and speech feature while maintaining discrimination of for Âi and V̂i belonging to their corresponding class Ci .
the features compared to features from other bird (CUB) or Total loss for training SEN is finally given by
flower (Oxford-102) classes, matching loss and distinctive loss
are proposed. LSEN = Lm + Ld . (6)
Matching loss is designed to minimize the distance of a
matched image-speech pair. Specifically, in a batch of image- 2.3. Relation-supervised Densely-stacked Generative
speech embedding pairs {(Vi , Ai )}n Model (RDG)
i , where n is the batch size,
the probability for the speech embedding Ai matching with the After learning the visually-grounded and class-discriminative
image embedding Vi is speech embeddings, we employ RDG to generate images con-
ditioned on these speech embeddings. RDG consists of two
exp (βS (Ai , Vi )) sub-modules, which are a Densely-stacked Generator (DG) and
P (Vi |Ai ) = Pn , (1)
j=1 Mi,j exp (βS (Ai , Vj )) a Relation Supervisor (RS), see Figure 1.
where β is a smoothing factor, set as 10 following [6]. S (Ai , Vi ) 2.3.1. Densely-stacked Generator (DG)
is a cosine similarity score of Ai and Vi . As in a mini-batch, we RDG uses the multi-step generation structure [5, 7, 8] because
only treat (Vi , Ai ) as a positive matched pair, therefore we use of its previously shown performance. This structure generates
a mask Mi,j ∈ Rn×n to deactivate the effect of pairs from the images from small scale (low-resolution) to large scale (high-
same class. Specifically, resolution) step by step. Specifically, in our model, 64 × 64,
128 × 128, and 256 × 256 pixel images were generated in multi-
0, if Ai matches Vj & i 6= j, steps. To fully exploit the information of the hidden feature
Mij = (2)
1, otherwise , (hi ) of each step, we design a densely-stacked generator. With
the speech embedding A as input, the generated image in each
where Ai matches Vj means they come from the same class. The stacked generator can be expressed as follows:
loss function is then defined as the negative log probability of
h0 = F0 (z, F ca (A)) ,
2 For more details on the model and results, please see: hi = Fi (h0 , . . . , hi−1 , F ca (A)) , i ∈ {1, 2} , (7)
https://xinshengwang.github.io/project/s2igan/ Ii = Gi (hi ) , i ∈ {0, 1, 2},
where z is a noise vector sampled from a normal distribution. Table 1: Performance of S2IGAN compared to other methods.
F ca represents Conditioning Augmentation [22, 5] that aug- † means that the results are taken from the original paper. The
ments the speech features thus produces more image-speech best performance is shown in bold.
pairs. It is a popular and useful strategy which is used in most
recent text-to-speech generation tasks [9, 6, 7]. hi is the hidden CUB (Bird) Oxford-102 (Flower)
feature from the non-linear transformation Fi . hi is fed to the Evaluation Metric Input mAP FID IS mAP FID IS
generator Gi to obtain image Ii . StackGAN-v2 text 7.01 20.94 4.02±0.03 9.88 50.38 3.35±0.07
MirrorGAN† text — — 4.56±0.05 — — –
2.3.2. Relation Supervisor (RS) SEGAN† text — — 4.67±0.04 — — –
[12]† speech — 18.37 4.09±0.04 — 54.76 3.23±0.05
To ensure that the generator produces high-quality images that StackGAN-v2 speech 8.09 18.94 4.14±0.04 12.18 54.33 3.69±0.08
are semantically aligned with the spoken description, we propose S2IGAN speech 9.04 14.50 4.29±0.04 13.40 48.64 3.55±0.04
a relation supervisor to provide a strong relation constraint to
the generation process. Specifically, we form an image set for
each generated image Ii , i.e., {Ii , IîGT , IîRI , IîM I } indicating loss discriminating whether the image and the speech description
the generated fake image, the ground-truth image, a real image match or not. The Ii is from the model distribution Gi at the
from the same class as Ii , and a real image from a different ith scale, and Iî is from the real image distribution pdatai at
randomly-sampled class, respectively. We then define three the same scale. The generators and discriminators were trained
types of relation classes: 1) a positive relation L1 , between alternately.
IîGT and IîRI ; 2) a negative relation L2 , between IîGT and IîM I ;
3) an undesired relation L3 , between IîGT and IîGT . A relation 2.4. Evaluation Metrics
classifier is trained to classify these three relations. We expect the
We use two metrics to evaluate the performance of our SI2GAN
relation between Ii and IîGT to be close to the positive relation
model. To evaluate diversity and quality of the generated im-
L1 , because Ii should semantically align with its corresponding
ages, we used two popular evaluation metrics for quantitative
IîGT , however, it should not be identical to IîGT to ensure the
evaluation of generative models as that in [5]: Inception score
diversity of the generated results. Therefore, the loss function
(IS) [23] and fréchet inception distance (FID) [24], where, a
for training the RS is defined as:
higher IS means more diversity and a lower FID means a smaller
3
X distance between the generated and real image distributions,
LRS = − log P̂ (Lj |Rj ) − log P̂ (L1 |RGT −F I ) , which indicates better generated images.
j=1 The visual-semantic consistency between the generated im-
(8) ages and their speech descriptions is evaluated through a content-
where Rj is a relation vector produced by RS with the input of based image retrieval experiment between the real images and
a pair of images with relation Lj , e.g., R1 = RS IˆGT , IˆRI .

the generated images, and evaluated using mAP scores. Specif-
RGT −F I is the vector of relation between IîGT and Ii . Note ically, we randomly chose two real images from each class of
that we apply RS only to the last generated image, i.e., i = 2, the test set, resulting in a query pool. Then we used these query
for computational efficiency. images to retrieve generated fake images that belong to their
corresponding classes. We used the pre-trained Inception-v3 to
2.3.3. Objective Function extract features of all images. Higher mAP indicates a closer fea-
ture distance between fake images and their ground truth images,
The final objective function of RDG is defined as: which indirectly shows a higher semantic consistency between
2 generated images and their corresponding speech descriptions.
X
LG = LGi + LRS , (9)
i=0 3. Results
where the loss function for the i-th generator Gi is defined as: 3.1. Objective Results
LGi = − EIi ∼pG i [log Di (Ii )] + We compare our results with several state-of-the-art T2IG meth-
(10) ods, including StackGAN-v2 [5], MirrorGAN [7] and SEGAN
− EIi ∼pG i log Di Ii , F ca A

.
[9]. StackGAN-v2 is a strong baseline for the T2IG task and pro-
The loss function for the corresponding discriminator D of RDG vides the effective stacked structure for the following methods.
is given by: Both MirrorGAN and SEGAN are based on the stacked structure.
X2 MirrorGAN utilizes word-level [6] and sentence-level attention
LD = LDi , (11) mechanisms, and a “text-to-image-to-text” structure for T2IG,
i=0 and SEGAN also uses word-level attention with extra proposed
where the loss function for the i-th discriminator Di is given by: attention regularization and a siamese structure. In order to allow
for a direct comparison on the S2IG task to StackGAN-v2, we
LDi = − EIî ∼pdata log Di Iî +

i
reimplemented StackGAN-v2 and replaced the text embedding

− EIi ∼pG i log 1 − Di Ii +
with our speech embedding. Moreover, we compare our results
(12) to the recently released speech-based model by [12].
− EIî ∼pdata log Di Iî , F ca A +

i
The results are shown in Table 1. First, our method outper-
− EIi ∼pG i log 1 − Di Ii , F ca A

. formed [12] on all evaluation metrics and datasets. Compared
with the StackGAN-v2 that took our speech embedding as in-
Here, the first two items are unconditional loss that discriminate put, our S2IGAN also achieved higher mAP and lower FID on
the fake and real images, and the last two items are conditional both datasets. These results indicate that our method is effective
(b) (c) (d)
in generating high-quality and semantically consistent images Spoken (a) StackGAN-v2 StackGAN-v2 S2IGAN
on the basis of spoken descriptions. The comparison of our Description Ground Truth (T2IG) (S2IG) (S2IG)
S2IGAN with three state-of-the-art T2IG methods show that the A small blue bird
with long tail
S2IGAN method is competitive, and thus establishes a solid new feathers and short
baseline for the S2IG task. beak.
Speech input is generally considered to be more difficult

This bird has wings
to deal with than text because of its high variability, its long that are black and
duration, and the lack of pauses between words. Therefore, S2IG has an orange belly.
is more challenging than T2IG. However, the comparison of the
performances of StackGAN-v2 on the S2IG and T2IG tasks This flower has
petals that are pink
shows that StackGAN-v2 generated better images using speech with yellow and
embeddings learned by our SEN. Moreover, the StackGAN- black lines.
v2 based on our learned speech embeddings outperforms [12] The flower has thin
purple petals
on almost all evaluation metrics and datasets, except for the surround the red
slightly higher FID on CUB dataset. Note that [12] takes the stamen in the
middle.
native StackGAN-v2 as the generator, which means that the only
difference between [12] and the speech-based StackGAN-v2 in
Table 1 is the speech embedding method. These results confirm Figure 2: Examples of images generated by different methods.
that our learned speech embeddings are competitive compared
A small bird with a color-1 belly and color-2 wings.
to text input and the speech embeddings in [12], showing the
effectiveness of our SEN module.
Yellow
3.1.1. Subjective Results

The subjective visual results are shown in Figure 2. As can
color-1
be seen, the images synthesized by our S2IGAN (d) are photo- Red
realistic and convincing. By comparing the images generated by
(d) S2IGAN and (c) StackGAN-v2 conditioned on speech em-
beddings, we can see that the images generated by S2IGAN are
Grey
clearer and sharper, showing the effectiveness of the proposed
S2IGAN on synthesizing visually high-quality images. The com-
parison of StackGAN-v2 conditioned on (b) text and (c) speech Brown Black Blue
color-2
features embedded by the proposed SEN shows that our learned
speech embeddings are competitive compared with the text fea- Figure 3: Generated examples by S2IGAN. The generated images
tures embedded by StackGAN-v2, showing the effectiveness of are based on speech descriptions with different color keywords.
SEN. More results are shown on the project website2 .
To further illustrate S2IGAN’s ability to catch subtle seman-
tic differences in the speech descriptions, we generated images
of the speech embeddings. The results of extensive experiments
conditioned on speech descriptions in which color keywords
show that our S2IGAN has state-of-the-art performance, and that
were changed. As Figure 3 shows, the visual semantics of the
the learned speech embeddings capture the semantic information
generated birds, specifically, the colors of the belly and the wings,
in the speech signal.
are consistent with the corresponding semantic information in
the spoken descriptions. These visualization results indicate that The current work is based on synthesized speech, which
SEN successfully learned the semantic information in the speech makes the current S2IG baseline an upper-bound baseline. The
signal, and that our RDG is capable of capturing these seman- future work will focus on several directions. First, we will
tics and generating discriminative images that are semantically investigate this task with natural speech instead of synthesized
aligned with the input speech. speech. Second, it will be highly interesting to test the proposed
methodology on a true unwritten language rather than the well-
resourced English language. Third, we will further improve
3.2. Component analysis
our methods in terms of efficiency and accuracy, for example,
An extensive ablation study investigated the effectiveness of key by making end-to-end training more effective and efficient and
components of SI2GAN. Specifically, the effects of the densely- by applying attention mechanisms to our generator to further
stacked structure of DG, RS, and SEN were investigated by improve the quality of the generated images. An interesting
removing each of these components respectively. Removing any avenue for future research would be to automatically discover
component resulted in a clear decrease of the generation perfor- speech units based on corresponding visual information from
mance, showing the effectiveness of each component. Details the speech signal [25] to segment the speech signal. This would
can be found on the project website2 . allow us to use segment- and word-level attention mechanisms,
which have shown to lead to improved performance on the text-
4. Discussion and Conclusion to-image generation task [6], to improve the performance of
speech-to-image generation.
This paper introduces a novel speech-to-image generation (S2IG)
task and we developed a novel generative model, called S2IGAN,
which tackles S2IG in two steps. First, semantically discrimi-
5. Acknowledgements
native speech embeddings are learned by a speech embedding This work has been partially supported by the China Scholarship
network. Second, high-quality images are generated on the basis Council (CSC).
6. References [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- large scale visual recognition challenge,” International journal of
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- computer vision, vol. 115, no. 3, pp. 211–252, 2015.
sarial nets,” in Advances in neural information processing systems,
2014, pp. 2672–2680. [21] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-
[2] M. Mirza and S. Osindero, “Conditional generative adversarial
sentations using rnn encoder-decoder for statistical machine trans-
nets,” arXiv preprint arXiv:1411.1784, 2014.
lation,” arXiv preprint arXiv:1406.1078, 2014.
[3] Y. Balaji, M. R. Min, B. Bai, R. Chellappa, and H. P. Graf, “Con-
ditional gan with discriminative filter generation for text-to-video [22] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
synthesis,” in Proceedings of the 28th International Joint Confer- Metaxas, “Stackgan: Text to photo-realistic image synthesis with
ence on Artificial Intelligence. AAAI Press, 2019, pp. 1995–2001. stacked generative adversarial networks,” in Proceedings of the
IEEE international conference on computer vision, 2017, pp. 5907–
[4] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, 5915.
“Generative adversarial text to image synthesis,” arXiv preprint
arXiv:1605.05396, 2016. [23] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen, “Improved techniques for training gans,” in Advances
[5] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. in neural information processing systems, 2016, pp. 2234–2242.
Metaxas, “Stackgan++: Realistic image synthesis with stacked
generative adversarial networks,” IEEE transactions on pattern [24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre-
analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, iter, “Gans trained by a two time-scale update rule converge to a
2018. local nash equilibrium,” in Advances in neural information pro-
cessing systems, 2017, pp. 6626–6637.
[6] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
X. He, “Attngan: Fine-grained text to image generation with atten- [25] D. Harwath and J. Glass, “Towards visually grounded sub-word
tional generative adversarial networks,” in Proceedings of the IEEE speech unit discovery,” in ICASSP 2019-2019 IEEE International
conference on computer vision and pattern recognition, 2018, pp. Conference on Acoustics, Speech and Signal Processing (ICASSP).
1316–1324. IEEE, 2019, pp. 3017–3021.
[7] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-
to-image generation by redescription,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019,
pp. 1505–1514.
[8] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics
disentangling for text-to-image generation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
2019, pp. 2327–2336.
[9] H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced
adversarial nets for text-to-image synthesis,” in Proceedings of
the IEEE International Conference on Computer Vision, 2019, pp.
10 501–10 510.
[10] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubin-
stein, and W. Matusik, “Speech2face: Learning the face behind a
voice,” in CVPR, 2019.
[11] Y. Wen, B. Raj, and R. Singh, “Face reconstruction from voice
using generative adversarial networks,” in NeurIPS, 2019.
[12] J. Li, X. Zhang, C. Jia, J. Xu, L. Zhang, Y. Wang, S. Ma, and
W. Gao, “Direct speech-to-image translation,” arXiv preprint
arXiv:2004.03413, 2020.
[13] M. P. Lewis, G. F. Simons, and C. Fennig, “Ethnologue: Languages
of the world [eighteenth,” Dallas, Texas: SIL International, 2015.
[14] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
caltech-ucsd birds-200-2011 dataset,” 2011.
[15] M.-E. Nilsback and A. Zisserman, “Automated flower classification
over a large number of classes,” in 2008 Sixth Indian Conference
on Computer Vision, Graphics & Image Processing. IEEE, 2008,
pp. 722–729.
[16] S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep repre-
sentations of fine-grained visual descriptions,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 49–58.
[17] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts
synthesis by conditioning wavenet on mel spectrogram predictions,”
in 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
[18] D. Merkx, S. L. Frank, and M. Ernestus, “Language learning using
speech to image retrieval,” arXiv preprint arXiv:1909.03795, 2019.
[19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-
thinking the inception architecture for computer vision,” in Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 2818–2826.

Such Papers

Uploaded by

Copyright:

Available Formats

Such Papers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Such Papers

Uploaded by

Copyright:

Available Formats

S2IGAN: Speech-to-Image Generation via Adversarial Learning

Abstract speech to images might be more efficient and straightforward.

and obtain visually-grounded speech embeddings. SEN is a dual P (Vi |Ai ):

Speech input is generally considered to be more difficult

3.1.1. Subjective Results

You might also like