Abstract
Visual conversation is a dialog in which parties exchange visual information. The key novelty presented in this paper is an artificial intelligence-driven visual conversation automation method. We will present a state of the art Artificial Intelligence Snapchat Visual Conversation Agent (AISVCA). AISVCA uses our proposed artificial intelligence-driven visual conversation automation method to create received image caption and generate an appropriate reasonable visual response. These functionalities are achieved by using a combination of Convolutional Neural Network (CNN), Long Short-Term Memory Neural Network (LSTM) and, Latent Semantic Indexing method (LSI). CNN and LSTM are used to create image captions and, LSI is used to assess the semantic similarity between captions generated from personalized image dataset, and captions that are extracted from the received image content. We will show that AISVCA, using the proposed method can generate a visual response that is basically indistinguishable from a human visual response. To evaluate the proposed approach, we measured the accuracy of the proposed system and, conducted a user study to test communication quality. In the user study, we analyzed source credibility and interpersonal attraction of the AISVCA. The user study results showed that there are no significant differences in communication quality between a visual conversation with AISVCA and visual conversation with the human agent.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal A, Lu J, Antol S, Mitchell M, Zitnick C L, Parikh D, Batra D (2017) Vqa: Visual question answering. Int J Comput Vis 123(1):4–31
Chattopadhyay P, Yadav D, Prabhu V, Chandrasekaran A, Das A, Lee S, Batra D, Parikh D (2017) Evaluating visual conversational agents via cooperative human-ai games. arXiv:170805122
Chen J, Dong W, Li M (2016) Image caption generator based on deep neural networks
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2
Edwards C, Edwards A, Spence P, Shelton A (2014) Is that a bot running the social media feed? testing the differences in perceptions of communication quality for a human agent and a bot agent on twitter 33:372–376
Edwards C, Edwards A, Spence P R, Shelton A K (2014) Is that a bot running the social media feed? testing the differences in perceptions of communication quality for a human agent and a bot agent on twitter. Comput Hum Behav 33:372–376
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Haas C, Wearden S T (2003) E-credibility: Building common ground in web environments. L1-Educational Studies in Language and Literature 3(1-2):169–184
Hofmann T (2017) Probabilistic latent semantic indexing. In: ACM SIGIR forum, ACM, vol 51, pp 211–218
Hosseini M H, Nahad R F (2012) Investigating antecedents and consequences of open university brand image. Int J Acad Res 4(4):953–960
Klassen A C, Creswell J, Clark V L P, Smith K C, Meissner H I (2012) Best practices in mixed methods for quality of life research. Qual Life Res 21(3):377–380
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Manning C D, Raghavan P, Schütze H (2008) Matrix decompositions and latent semantic indexing. Introduction to Information Retrieval pp 403–417
McCroskey J C, McCain T A (1974) The measurement of interpersonal attraction. Speech Monographs 41 (3):261–266. https://doi.org/10.1080/03637757409375845
McCroskey J C, Teven J J (1999) Goodwill: A reexamination of the construct and its measurement. Communications Monographs 66(1):90–103
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: 11th annual conference of the international speech communication association
Mostafazadeh N, Misra I, Devlin J, Mitchell M, He X, Vanderwende L (2016) Generating natural questions about an image. arXiv:160306059
Ohanian R (1991) The impact of celebrity spokespersons’ perceived image on consumers’ intention to purchase. Journal of advertising Research
Sharma S, Suhubdy D, Michalski V, Kahou SE, Bengio Y (2018) Chatpainter: Improving text to image generation using dialogue. arXiv:180208216
Soh M (2016) Learning cnn-lstm architectures for image caption generation
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3156–3164
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence 39(4):652–663
Wagner K (2017) Snapchat is still bigger than instagram for younger u.s. millennials. https://www.recode.net/2017/8/24/16198632/snapchat-instagram-teens-comscore-study-growth-users
Wagner K (2017) Snapchat is still the network of choice for u.s. teens - and instagram is facebook best shot at catching up. https://www.recode.net/2017/12/16/16783570/snapchat-instagram-teenagers-rbc-survey-favorite-app
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Zhang H, Xu T, Li H, Zhang S, Huang X, Wang X, Metaxas D (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE Int. Conf. Comput. Vision (ICCV), pp 5907–5915
Zhang Y, Jin R, Zhou Z H (2010) Understanding bag-of-words model: A statistical framework. Int J Mach Learn Cybern 1(1-4):43–52
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Arsovski, S., Cheok, A.D., Govindarajoo, K. et al. Artificial intelligence snapchat: Visual conversation agent. Appl Intell 50, 2040–2049 (2020). https://doi.org/10.1007/s10489-019-01621-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01621-2