Text To Image Generation Using XLNet-Paper Draft
Text To Image Generation Using XLNet-Paper Draft
To implement a text-to-image generation model using Figure 1: The model consists of three main components.
XLNet as a text encoder and a GAN-based
architecture for image synthesis. The generator network takes the encoded text vectors as input
and generates images that match the textual descriptions. The
To evaluate the model's performance on a dataset of
discriminator evaluates the generated images, distinguishing
flower images and corresponding descriptions.
between real and synthetic images [5]. The generator and
To compare the effectiveness of XLNet-based encoding
discriminator are trained adversarial, with the generator
against other text encoding techniques.
improving its ability to create realistic images and the
discriminator enhancing its ability to identify fakes. Training
II. RELATED WORK our model involves alternating between optimizing the
Previous efforts in text-to-image synthesis have utilized generator and the discriminator. XLNet excels in capturing
various deep learning techniques, notably GANs, coupled with bidirectional context and has achieved state-of-the-art
recurrent neural networks (RNNs) [1] or simple transformer performance on various NLP tasks.
models like BERT. While GANs such as StackGAN and
AttnGAN [2] showed promising results, their reliance on older
text models limited their ability to effectively capture long-range
[1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, [7] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired
"Evaluation of Output Embeddings for Fine-Grained Image Image-to-Image Translation Using Cycle-Consistent.
Classification," in Proc. CVPR, 2015.
Adversarial Networks," in Proc. IEEE International
[2] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Conference on Computer Vision, pp. 2223-2232, 2017.
Wierstra, "Draw: A Recurrent Neural Network for Image
Generation," in Proc. ICML, 2015. [8] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele,
and H. Lee, "Generative Adversarial Text to Image
[3] J. Yang, S. Reed, M.-H. Yang, and H. Lee, "Weakly- Synthesis," arXiv preprint arXiv:1605.05396, 2016.
Supervised Disentangling with Recurrent Transformations for
3D View Synthesis," in Proc. NIPS, 2015. [9] K. Sohn, W. Shang, and H. Lee, "Improved Multimodal
Deep Learning with Variation of Information," in Proc. NIPS,
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. 2014.
Warde-Farley, S. Ozair, ... and Y. Bengio, "Generative
Adversarial Nets," Advances in Neural Information [10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show
Processing Systems, vol. 27, pp. 2672-2680, 2014. and Tell: A Neural Image Caption Generator," in Proc. CVPR,
2015.
[5] Z. Yang, D. Dai, Y. Yang, J. Carbonell, R. R.
Salakhutdinov, and Q. V. Le, "Generalized Autoregressive
Pretraining for Language Understanding," Advances in Neural
Information Processing Systems, vol. 32, pp. 5754-5764,
2019.