Text-to-Image Generation Using Deep Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceeding Paper

Text-to-Image Generation Using Deep Learning †


Sadia Ramzan 1, * , Muhammad Munwar Iqbal 1 and Tehmina Kalsum 2

1 Department of Computer Science, University of Engineering and Technology, Taxila 47050, Pakistan;
munwar.iq@uettaxila.edu.pk
2 Department of Software Engineering, University of Engineering and Technology, Taxila 47050, Pakistan;
tehmina.kalsum@uettaxila.edu.pk
* Correspondence: sadia.ramzan@students.uettaxila.edu.pk
† Presented at the 7th International Electrical Engineering Conference, Karachi, Pakistan, 25–26 March 2022.

Abstract: Text-to-image generation is a method used for generating images related to given tex-
tual descriptions. It has a significant influence on many research areas as well as a diverse set of
applications (e.g., photo-searching, photo-editing, art generation, computer-aided design, image re-
construction, captioning, and portrait drawing). The most challenging task is to consistently produce
realistic images according to given conditions. Existing algorithms for text-to-image generation create
pictures that do not properly match the text. We considered this issue in our study and built a deep
learning-based architecture for semantically consistent image generation: recurrent convolutional
generative adversarial network (RC-GAN). RC-GAN successfully bridges the advancements in text
and picture modelling, converting visual notions from words to pixels. The proposed model was
trained on the Oxford-102 flowers dataset, and its performance was evaluated using an inception
score and PSNR. The experimental results demonstrate that our model is capable of generating more
realistic photos of flowers from given captions, with an inception score of 4.15 and a PSNR value of
30.12 dB, respectively. In the future, we aim to train the proposed model on multiple datasets.

Keywords: convolutional neural network; recurrent neural network; deep learning; generative
adversarial networks; image generation

Citation: Ramzan, S.; Iqbal, M.M.;


Kalsum, T. Text-to-Image Generation
Using Deep Learning. Eng. Proc. 1. Introduction
2022, 20, 16. https://doi.org/ When people listen to or read a narrative, they quickly create pictures in their mind
10.3390/engproc2022020016 to visualize the content. Many cognitive functions, such as memorization, reasoning
Academic Editor: Saad Ahmed ability, and thinking, rely on visual mental imaging or “seeing with the mind’s eye” [1].
Qazi Developing a technology that recognizes the connection between vision and words and can
produce pictures that represent the meaning of written descriptions is a big step toward
Published: 29 July 2022
user intellectual ability.
Publisher’s Note: MDPI stays neutral Image-processing techniques and applications of computer vision (CV) have grown
with regard to jurisdictional claims in immensely in recent years from advances made possible by artificial intelligence and deep
published maps and institutional affil- learning’s success. One of these growing fields is text-to-image generation. The term text-to-
iations. image (T2I) is the generation of visually realistic pictures from text inputs. T2I generation is
the reverse process of image captioning, also known as image-to-text (I2T) generation [2–4],
which is the generation of textual description from an input image. In T2I generation, the
model takes an input in the form of human written description and produces a RGB image
Copyright: © 2022 by the authors.
that matches the description. T2I generation has been an important field of study due to its
Licensee MDPI, Basel, Switzerland.
tremendous capability in multiple areas. Photo-searching, photo-editing, art generation,
This article is an open access article
captioning, portrait drawing, industrial design, and image manipulation are some common
distributed under the terms and
applications of creating photo-realistic images from text.
conditions of the Creative Commons
Attribution (CC BY) license (https://
The evolution of generative adversarial networks (GANs) has demonstrated excep-
creativecommons.org/licenses/by/
tional performance in image synthesis, image super-resolution, data augmentation, and
4.0/).
image-to-image conversion. GANs are deep learning-based convolutional neural networks

Eng. Proc. 2022, 20, 16. https://doi.org/10.3390/engproc2022020016 https://www.mdpi.com/journal/engproc


Eng. Proc. 2022, 20, 16 2 of 6

(CNNs) [5,6]. It consists of two neural networks: one for generating data and the other for
classifying real/fake data. GANs are based on game theory for learning generative models.
Its major purpose is to train a generator (G) to generate samples and a discriminator (D)
to discern between true and false data. For generating better-quality realistic image, we
performed text encoding using recurrent neural networks (RNN), and convolutional layers
were used for image decoding. We developed recurrent convolution GAN (RC-GAN),
a simple an effective framework for appealing to image synthesis from human written
textual descriptions. The model was trained on the Oxford-102 Flowers Dataset and ensures
the identity of the synthesized pictures. The key contributions of this research include
the following:
• Building a deep learning model RC-GAN for generating more realistic images.
• Generating more realistic images from given textual descriptions.
• Improving the inception score and PSNR value of images generated from text.
The following is how the rest of the paper is arranged: In Section 2, related work is
described. The dataset and its preprocessing are discussed in Section 3. Section 4 explains
the details of the research methodology and dataset used in this paper. The experimental
details and results are discussed in Section 5. Finally, the paper is concluded in Section 6.

2. Related Work
GANs were first introduced by Goodfellow [7] in 2014, but Reed et al. [8] was the
first to use them for text-to-image generation in 2016. Salimans et al. [9] proposed training
stabilizing techniques for previously untrainable models and achieved better results on
the MNIST, CIFAR-10, and SVHN datasets. The attention-based recurrent neural network
was developed by Zia et al. [10]. In their model, word-to-pixel dependencies were learned
by an attention-based auto-encoder and pixel-to-pixel dependencies were learned by an
autoregressive-based decoder. Liu et al. [11] offered a diverse conditional image synthesis
model and performed large-scale experiments for different conditional generation tasks.
Gao et al. [12] proposed an effective approach known as lightweight dynamic conditional
GAN (LD-CGAN), which disentangled the text attributes and provided image features by
capturing multi-scale features. Dong et al. [13] trained a model for generating images from
text in an unsupervised manner. Berrahal et al. [14] focused on the development of text-
to-image conversion applications. They used deep fusion GAN (DF-GAN) for generating
human face images from textual descriptions. The cross-domain feature fusion GAN (CF-
GAN) was proposed by Zhang et al. [15] for converting textual descriptions into images
with more semantic detail. In general, the existing methods of text-to-image generation use
wide-ranging parameters and heavy computations for generating high-resolution images,
which result in unstable and high-cost training.

3. Dataset and Preprocessing


3.1. Dataset
The dataset used was Oxford-102 flowers, which include 8189 images of flowers that
are all of different species. It has 102 classes; each class consists of 40 to 60 images, and
each of the images have 10 matching textual descriptions. In this study, we considered
8000 images for training. This dataset was used to train the model for 300 epochs.

3.2. Data Preprocessing


When the data were collected and extracted initially, it consisted of 8189 images with
different sizes of resolutions and corresponding textual descriptions. For normalizing the
textual data, we used an NLTK tokenizer, which converted the textual sentences into words.
These tokenized lists of words were transformed into an array of caption ids. The images
were loaded for resizing to the same dimensions. All training images and testing images
were resized to a resolution of 128 × 128. For training purposes, the images were converted
into arrays, and both the vocabulary and images were loaded onto the model.
Eng. Proc. 2022, 20, 16 3 of 6

images were loaded for resizing to the same dimensions. All training images and testing
Eng. Proc. 2022, 20, 16 images were resized to a resolution of 128 × 128. For training purposes, the images were
3 of 6
converted into arrays, and both the vocabulary and images were loaded onto the model.

4. Proposed
4. Proposed Methodology
Methodology
This section
This section describes
describes the
the training
training details
details ofof deep
deep learning-based
learning-based generative
generative models.
models.
Conditional GANs were used with recurrent neural networks
Conditional GANs were used with recurrent neural networks (RNNs) and convolutional (RNNs) and convolutional
neuralnetworks
neural networks(CNNs)
(CNNs)forfor generating
generating meaningful
meaningful images
images from from a textual
a textual description.
description. The
The dataset used consisted of images of flowers and their relevant
dataset used consisted of images of flowers and their relevant textual descriptions. For textual descriptions.
For generating
generating plausible
plausible images
images from fromtexttext
usingusing a GAN,
a GAN, preprocessingofoftextual
preprocessing textualdata
dataand
and
imageresizing
image resizingwas
wasperformed.
performed. WeWe tooktook textual
textual descriptions
descriptions fromfrom the dataset,
the dataset, prepro-
preprocessed
cessedcaption
these these caption sentences,
sentences, and created
and created a list ofatheir
list ofvocabulary.
their vocabulary.Then, Then, these captions
these captions were
were stored with their respective ids in the list. The images were loaded
stored with their respective ids in the list. The images were loaded and resized to a fixed and resized to a
fixed dimension. These data were then given as input to our
dimension. These data were then given as input to our proposed model. RNN was used proposed model. RNN was
used
for for capturing
capturing the contextual
the contextual information
information of sequences
of text text sequences by defining
by defining the relation-
the relationship
ship between
between words words at altered
at altered time time
stamps.stamps. Text-to-image
Text-to-image mapping
mapping waswas performed
performed using
using an
an RNN
RNN andand a CNN.
a CNN. TheTheCNN CNN recognized
recognized useful
useful characteristics
characteristics fromfrom the images
the images without
without the
the need
need for human
for human intervention.
intervention. An input
An input sequencesequence was given
was given to theto the RNN,
RNN, whichwhich con-
converted
the textual
verted the descriptions into word
textual descriptions embeddings
into word embeddings with a size withof a256.
sizeThese
of 256. word
Theseembeddings
word em-
were concatenated
beddings with a 512-dimensional
were concatenated noise vector.noise
with a 512-dimensional To train our To
vector. model,
trainwe
ourtook a batch
model, we
size
tookofa 64 with
batch gated-feedback
size 128 and fed the
of 64 with gated-feedback 128input
and fed noisetheand
inputtext input
noise andto text
a generator.
input to
The architecture
a generator. Theof the proposed
architecture model
of the is presented
proposed modelin is Figure
presented1. in Figure 1.

Figure 1.
Figure 1. Architecture
Architecture of
of the
theproposed
proposedmethod,
method,which
whichcan
cangenerate
generateimages
imagesfrom
fromtext
textdescriptions.
descriptions.

Semantic
Semanticinformation
informationfrom
fromthe
thetextual
textualdescription
descriptionwas
wasusedusedas
as input
input in
in the
the generator
generator
model,
model, which converts characteristic information to pixels and generates theimages.
which converts characteristic information to pixels and generates the images. This
This
generated
generated image
image waswas used
used asasinput
inputininthe
thediscriminator
discriminatoralong
alongwith
withreal/wrong
real/wrong textual
textual
descriptions
descriptions and
and real
realsample
sampleimages
imagesfrom
fromthe
thedataset.
dataset.
A
A sequence of distinct (picture and text) pairings are
sequence of distinct (picture and text) pairings are then
then provided
provided as as input
input to
to the
the
model to meet the goals of the discriminator: input pairs of real images and real textual
model to meet the goals of the discriminator: input pairs of real images and real textual
descriptions, wrong images and mismatched textual descriptions, and generated images
descriptions, wrong images and mismatched textual descriptions, and generated images
and real textual descriptions. The real photo and real text combinations are provided so
and real textual descriptions. The real photo and real text combinations are provided so
that the model can determine if a particular image and text combination align. An incorrect
that the model can determine if a particular image and text combination align. An incor-
picture and real text description indicates that the image does not match the caption. The
rect picture and real text description indicates that the image does not match the caption.
discriminator is trained to identify real and generated images. At the start of training, the
The discriminator is trained to identify real and generated images. At the start of training,
discriminator was good at classification of real/wrong images. Loss was calculated to
the discriminator was good at classification of real/wrong images. Loss was calculated to
improve the weight and to provide training feedback to the generator and discriminator
improve the weight and to provide training feedback to the generator and discriminator
model. As soon as the training proceeded, the generator produced more realistic images
model. As soon as the training proceeded, the generator produced more realistic images
and it fooled the discriminator when distinguishing between real and generated images.
and it fooled the discriminator when distinguishing between real and generated images.
5. Results and Discussion
In this section, the experimental analysis and generated images of flowers are pre-
sented. The training of the proposed model was performed on an Nvidia 1070 Ti GPU,
32 GB memory and windows 10 operating system. The weights of the generator and
discriminator were optimized using an Adam optimizer, the mini-batch size was 64, and
5. Results and Discussion
In this section, the experimental analysis and generated images of flowers are pre-
sented. The training of the proposed model was performed on an Nvidia 1070 Ti GPU, 32
Eng. Proc. 2022, 20, 16 GB memory and windows 10 operating system. The weights of the generator and discrim- 4 of 6
inator were optimized using an Adam optimizer, the mini-batch size was 64, and the
learning rate was 0.0003. The ground truths from the dataset and the images generated from
the input textual descriptions are shown in Figure 2. For evaluating the performance of the
the learning rate was 0.0003. The ground truths from the dataset and the images generated
proposed model, the inception score (IS) and PSNR values are calculated. Inception scores
from the input textual descriptions are shown in Figure 2. For evaluating the performance
capture the diversity and quality of the generated images. PSNR is used for calculating
of the proposed model, the inception score (IS) and PSNR values are calculated. Incep-
the peak signal-to-noise ratio in decibels among two images. The quality of the original
tion scores capture the diversity and quality of the generated images. PSNR is used for
and produced images is compared using this ratio. The PSNR value increases as the qual-
calculating the peak signal-to-noise ratio in decibels among two images. The quality of the
ity of the created or synthesized image improves. PSNR value is calculated using follow-
original and produced images is compared using this ratio. The PSNR value increases as
ing
the equation:
quality of the created or synthesized image improves. PSNR value is calculated using
following equation: PSNR = 10log10 (R2/MSE) (1)
PSNR = 10log10 (R2/MSE) (1)

Figure 2. Input textual descriptions and resultant generated images with ground truths.
Figure 2. Input textual descriptions and resultant generated images with ground truths.
To validate the proposed approach, the results are compared with those of existing
To validate the proposed approach, the results are compared with those of existing
models including GAN-INT-CLS, StackGAN, StackGAN++, HDGAN, and DualAttn-GAN
models including GAN-INT-CLS, StackGAN, StackGAN++, HDGAN, and DualAttn-GAN
on the Oxford-102 flowers dataset. This performance comparison in terms of inception
on the Oxford-102 flowers dataset. This performance comparison in terms of inception score
score is shown in Table 1 and that of the PSNR value is shown in Table 2. These results
isshow
shown inour
that Table 1 andisthat
model of the
capable ofPSNR value more
generating is shown in Table 2. These
unambiguous resultsphotos
and diverse show that
than
our
the model is capable of generating more unambiguous and diverse photos than the other
other models.
models.
Table 1. Performance comparison of state-of-the-art methods vs. the methodology presented here by
Table 1. Performance
inception score. comparison of state-of-the-art methods vs. the methodology presented here
by inception score.
Ref. Models Inception Score
Ref. Models Inception Score
Reed et al. [8] GAN-INT-CLS 2.66 ± 0.03
Reed
Zhangetetal.
al. [8]
[16] GAN-INT-CLS
StackGAN 2.66
3.20±±0.03
0.01
Zhang et al.
Zhang et al. [16][17] StackGAN++
StackGAN 3.26±±0.01
3.20 0.01
Zhang et al. [18] HDGAN 3.45 ± 0.05
Zhang
Cai etetal.
al.[19]
[17] StackGAN++
DualAttn-GAN 3.26
4.06±±0.01
0.05
Proposed Method
Zhang et al. [18] RC-GAN
HDGAN 4.15±±0.05
3.45 0.03

Table 2. PSNR value of generated images.

Ref. Model PSNR Value


Proposed Method RC-GAN 30.12 dB
Eng. Proc. 2022, 20, 16 5 of 6

6. Conclusions and Future Work


In the fields of computer vision and natural language processing, text-to-image gener-
ation is a hot topic these days. For producing visually realistic and semantically consistent
images, we presented a deep learning-based model (RC-GAN) and described how it works
in the confluence of computer vision and natural language processing. The model was
trained by text encoding and image decoding. Extensive experiments on the Oxford-102
flowers dataset demonstrated that the proposed GAN model generates better-quality im-
ages compared with existing models by providing the best recorded IS. The performance of
our proposed method was compared with that of state-of-the-art methods using IS. Our
model achieved an IS of 4.15 and a PSNR value of 30.12 dB on the Oxford-102 flowers
dataset. We want to expand this work by training the model on a variety of datasets for
image generation.

Author Contributions: The authors confirm contribution to the paper as follows: study conception,
design and data collection: S.R.; analysis and interpretation of results: S.R., M.M.I., T.K.; draft
manuscript preparation: S.R., M.M.I. All authors have read and agreed to the published version of
the manuscript.
Funding: The Authors do not received funding for this research.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Dataset is publicly available.
Conflicts of Interest: The authors are anonymously declared that they have no conflict of interest.

References
1. Kosslyn, S.M.; Ganis, G.; Thompson, W.L. Neural foundations of imagery. Nat. Rev. Neurosci. 2001, 2, 635–642. [CrossRef]
[PubMed]
2. Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137.
3. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164.
4. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption
generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July
2015; pp. 2048–2057.
5. Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017
International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6.
6. Kim, P. Convolutional neural network. In MATLAB Deep Learning; Springer: Berlin/Heidelberg, Germany, 2017; pp. 121–147.
7. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial
networks. arXiv 2014, arXiv:1406.2661. [CrossRef]
8. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. arXiv 2016,
arXiv:1605.05396.
9. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In
Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016.
10. Zia, T.; Arif, S.; Murtaza, S.; Ullah, M.A. Text-to-Image Generation with Attention Based Recurrent Neural Networks. arXiv 2020,
arXiv:2001.06658.
11. Liu, R.; Ge, Y.; Choi, C.L.; Wang, X.; Li, H. Divco: Diverse conditional image synthesis via contrastive generative adversarial
network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25
June 2021; pp. 16377–16386.
12. Gao, L.; Chen, D.; Zhao, Z.; Shao, J.; Shen, H.T. Lightweight dynamic conditional GAN with pyramid attention for text-to-image
synthesis. Pattern Recognit. 2021, 110, 107384. [CrossRef]
13. Dong, Y.; Zhang, Y.; Ma, L.; Wang, Z.; Luo, J. Unsupervised text-to-image synthesis. Pattern Recognit. 2021, 110, 107573. [CrossRef]
14. Berrahal, M.; Azizi, M. Optimal text-to-image synthesis model for generating portrait images using generative adversarial
network techniques. Indones. J. Electr. Eng. Comput. Sci. 2022, 25, 972–979. [CrossRef]
15. Zhang, Y.; Han, S.; Zhang, Z.; Wang, J.; Bi, H. CF-GAN: Cross-domain feature fusion generative adversarial network for
text-to-image synthesis. Vis. Comput. 2022, 1–11. [CrossRef]
Eng. Proc. 2022, 20, 16 6 of 6

16. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan: Text to photo-realistic image synthesis with
stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy,
22–29 October 2017; pp. 5907–5915.
17. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan++: Realistic image synthesis with stacked
generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1947–1962. [CrossRef] [PubMed]
18. Zhang, Z.; Xie, Y.; Yang, L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6199–6208.
19. Cai, Y.; Wang, X.; Yu, Z.; Li, F.; Xu, P.; Li, Y.; Li, L. Dualattn-GAN: Text to image synthesis with dual attentional generative
adversarial network. IEEE Access 2019, 7, 183706–183716. [CrossRef]

You might also like