Ref12
Ref12
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/334417783
CITATIONS READS
26 654
2 authors:
1 PUBLICATION 26 CITATIONS
University of Rijeka
77 PUBLICATIONS 1,306 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Marina Ivašić-Kos on 19 February 2020.
Abstract – Image captioning is a process of automatically strained their flexibility. Some authors [10] have reformu-
describing an image with one or more natural language lated image captioning as a ranking task. Ranking-based
sentences. In recent years, image captioning has witnessed approaches always return well-formed sentences, but they
rapid progress, from initial template-based models to the cannot generate new sentences or to describe composition-
current ones, based on deep neural networks. This paper gives ally new images [11], i.e., those containing objects that
an overview of issues and recent image captioning research, were observed during training but appear in different com-
with a particular emphasis on models that use the deep binations on the test image. In contrast, today’s state-of-the-
encoder-decoder architecture. We discuss the advantages and art models are generative and neural networks based. They
disadvantages of different approaches, along with reviewing
usually employ an encoder-decoder architecture by com-
some of the most commonly used evaluation metrics and
datasets.
bining a Convolutional Neural Network (CNN) with a Re-
current Neural Network (RNN).
Keywords – image captioning, encoder-decoder, attention The rest of the paper is organized as follows: The next
mechanism, deep neural networks Section provides some background information on the typ-
I. INTRODUCTION ical architecture of image captioning systems. Section III.
groups image captioning models according to the caption-
Recent advances in deep learning methods on percep- ing task and describes relevant models for each type. Sec-
tual tasks, such as image classification and object detection tion IV. presents some of the most commonly used data
[1, 2] have encouraged researchers to tackle even more dif- sets, along with a description of how they were collected.
ficult problems for which recognition is just a step towards Section V. lists the metrics and points to the problems that
to more complex reasoning about our visual world [3]. Im- arise when evaluating generative approaches. The paper
age captioning is one of such tasks. ends with a Conclusion.
The aim of image captioning is to automatically de- II. ARCHITECTURE AND LEARNING APPROACHES
scribe an image with one or more natural language sen-
tences. This is a problem that integrates computer vision A. Encoder-Decoder Framework
and natural language processing, so its main challenges Inspired by its success in Neural Machine Translation
arise from the need of translating between two distinct, but [12], many of the current state-of-the-art models for image
usually paired, modalities [4]. First, it is necessary to detect captioning employ the encoder-decoder architecture (Fig.
objects on the scene and determine the relationships be- 1). In this architecture, the encoder is used to map the input
tween them [5] and then, express the image content cor- into its real-valued fixed-dimensional vector representa-
rectly with properly formed sentences. The generated de- tion. A decoder then generates output, conditioned on the
scription is still much different from the way people de- representation produced by the encoder. The main ad-
scribe images because people rely on common sense and vantage of such a system is that it can be trained end-to-
experience, point out important details and ignore objects end, meaning that the parameters of the whole network are
and relationships that they imply [6]. Moreover, they often learned together, thereby avoiding the problem of aligning
use imagination to make descriptions vivid and interesting. several independent components.
Regardless of the existing limitations, image Image captioning is often understood as a task of trans-
captioning has already been proven to have useful lating one modality, i.e. an image, into another modality,
applications, such as helping visually impaired people in i.e. its description, so the encoder-decoder architecture has
performing daily tasks. Automatically generated been successfully applied with a convolutional neural net-
descriptions can also be used for content-based retrieval work (CNN) [13] on the encoder side, and a recurrent neu-
[7] or in social media communications. ral network (RNN) [14] on the decoder side.
Early image captioning approaches relied on the use of A CNN acts as a feature extractor that is usually pre-
predefined templates, which were filled in based on the re- trained on a large dataset for a classification task [15]. A
sults of the detection of elements on the scene [8, 9]. How- feature map from a convolutional layer or the vector repre-
ever, the advantage of such bottom-up approaches in terms sentation from a fully-connected layer is then used as image
of the ability to capture details was not enough to keep them representation. An RNN or one of its variants, such as the
in the focus of research interest. Generated sentences were long short-term memory (LSTM) network [16], is em-
too simple, lacking the fluency of human writing. Moreo- ployed for language modeling.
ver, such systems were heavily hand-designed, which con-
Figure 1. Encoder-decoder framework for image captioning: first a CNN encoder produces image representation (left), then an LSTM decoder gen-
erates caption conditioned on the representation produced by the encoder (right)
1
https://www.flickr.com
be useful, metrics should match the rating of human evalu- VI. CONCLUSION
ators, but it turned out to be a goal that is difficult to This paper presents an overview of recent advances in
achieve. Evaluation metrics should satisfy two criteria [22]: image captioning research, with a particular focus on mod-
(1) captions that are considered good by humans should els employing deep encoder-decoder architectures. The
achieve high scores, (2) captions that achieve high scores main advantage of such architectures is in that they are
should be considered good by humans. trainable end-to-end, mapping directly from images to sen-
Image captioning is sometimes compared [47] to lan- tences.
guage translation [8, 17] or with text summarization [48], An important extension of the basic encoder-decoder
which motivated the adaption of metrics developed initially framework is the attention mechanism, which enables to fo-
for the evaluation of languages tasks [49, 50, 51]. All these cus on the most salient parts of the input image while gen-
metrics output a score indicating a similarity between the erating the next word of the output. We group the related
candidate sentence and the reference sentences. work into three types regarding the task: standard image
BLEU [49] is a popular metric for machine translation captioning (with or without an attention mechanism), im-
evaluation and one of the first metrics used to evaluate im- age captioning with style and cross-lingual or multilingual
age descriptions. It computes the geometric mean of n-gram image captioning.
precision scores multiplied by a brevity penalty in order to Large vision and language datasets have also contributed
avoid overly short sentences. significantly to the development of the field. Additional
METEOR [50] is another machine translation metric. It features of the new datasets, such as emotions or descrip-
relies on the use of stemmers, WordNet [52] synonyms and tions in different languages, will certainly stimulate even
paraphrases tables to identify matches between candidate faster advances in the periods to come.
sentence and reference sentences. However, there are some important tasks that are still un-
ROUGE [51] is a package of measures initially devel- resolved, such as generating captions more in the spirit of
oped for the evaluation of text summaries. For image cap- the human descriptions, automatic adaptation of descrip-
tioning, a variant ROUGEL is usually used, which com- tions to the given task, and perhaps the most challenging
putes F-measure based on the Longest Common Subse- among them, automatic assessment of the generated cap-
quence (LCS), i.e. a set of words shared by two sentences tions, since there are still no metrics to match human eval-
which occur in the same order, without requiring consecu- uation fully.
tive matches.
CIDEr [44] is a metric designed for the evaluation of ACKNOWLEDGMENT
automatically generated image captions. It measures the This research was supported by Croatian Science Foun-
similarity between the candidate sentence and a set of hu- dation under the project IP-2016-06-8345 “Automatic
man-written sentences by performing a Term Frequency recognition of actions and activities in multimedia content
Inverse Document Frequency (TF-IDF) weighting for each from the sports domain” (RAASS) and by the University of
n-gram. Rijeka under the project number 18-222-1385.
SPICE [45] is a metric designed for image caption eval-
uation. It measures the quality of generated captions by REFERENCES
computing an F-measure based on the propositional seman- [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
tic content of candidate and reference sentences represented classification with deep convolutional neural networks,” in
NIPS’12, 2012, vol. 1, pp. 1097–1105.
as scene graphs [53].
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
The metrics above represent a standard set of metrics real-time object detection with region proposal networks,” arXiv
usually reported in papers. Their popularity can be at- preprint arXiv:1506.01497, 2015.
tributed to their availability through the Microsoft COCO [3] R. Krishna et al., “Visual genome: Connecting language and vision
caption evaluation server [41], which enables a consistent using crowdsourced dense image annotations,” International
comparison of different models. Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
However, it was shown [47, 10] that automatic metrics [4] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural
language models,” in ICML-14, 2014, pp. 595–603.
do not always correlate with human judgments. This was
[5] M. Ivašić-Kos, I. Ipšić, S. Ribarić, “A knowledge-based multi-
particularly evident during the Microsoft COCO 2015 Cap- layered image annotation system.”, Expert systems with
tioning Challenge in that some models outperformed hu- applications. 42 (2015), 2015; 9539-9553
man upper bound according to automatic metrics, but hu- [6] M. Ivašić-Kos, M. Pavlić, M. Pobar, “Analyzing the semantic level
man judges demonstrated a preference for human-written of outdoor image annotation”, MIPRO 2009, IEEE Opatija, pp.
293-296
captions [54]. It seems that “humans do not always like
[7] M. Pobar, M. Ivašić-Kos, “Multimodal Image Retrieval Based on
what is human-like” [44]. Since there is no best metric, Keywords and Low-Level Image Features”, Semantic Keyword-
some authors [45, 46] advise the use of an ensemble of met- based Search on Structured Data, IKC 2015, Coimbra, Portugal,
rics capturing various dimensions, such as grammaticality, Springer, 2015. 133-140
saliency, correctness or truthfulness. In [22, 46] new eval- [8] G. Kulkarni et al., “Babytalk: Understanding and generating simple
uation metrics were proposed. image descriptions,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
[9] M. Ivašić-Kos, M. Pobar, S. Ribarić, Two-tier image annotation
model based on a multi-label classifier and fuzzy-knowledge
representation scheme, Pattern recognition. 52 (2016); 287-305
[10] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image [35] T. Chen et al., “Factual or Emotional: Stylized Image Captioning
description as a ranking task: Data, models and evaluation with Adaptive Learning and Attention,” in ECCV, 2018, pp. 519–
metrics,” JAIR, vol. 47, pp. 853–899, 2013. 535.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep [36] O. M. Nezami, M. Dras, P. Anderson, and L. Hamey, “Face-Cap:
captioning with multimodal recurrent neural networks (m-rnn),” Image Captioning Using Facial Expression Analysis,” in Joint
arXiv preprint arXiv:1412.6632, 2014. European Conference on Machine Learning and Knowledge
[12] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence Discovery in Databases, 2018, pp. 226–240.
learning with neural networks,” in NIPS’14, 2014, vol. 2, pp. 3104– [37] T. Miyazaki and N. Shimizu, “Cross-lingual image caption
3112. generation,” in Proceedings of the 54th Annual Meeting of the
[13] Y. LeCun and Y. Bengio, “Convolutional networks for images, Association for Computational Linguistics (Volume 1: Long
speech, and time series,” The Handbook of Brain Theory and Papers), 2016, vol. 1, pp. 1780–1790.
Neural Networks, vol. 3361, no. 10, 1995. [38] D. Elliott, S. Frank, and E. Hasler, “Multilingual image description
[14] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, with neural sequence models,” arXiv preprint arXiv:1510.04709,
no. 2, pp. 179–211, 1990. 2015.
[15] O. Russakovsky et al., “Imagenet large scale visual recognition [39] S. Tsutsui and D. Crandall, “Using artificial tokens to control
challenge,” International Journal of Computer Vision, vol. 115, no. languages for multilingual image caption generation,” arXiv
3, pp. 211–252, 2015. preprint arXiv:1706.06275, 2017.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [40] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier,
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. “Collecting image annotations using Amazon’s Mechanical Turk,”
in Proceedings of the NAACL HLT 2010 Workshop on Creating
[17] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A Speech and Language Data with Amazon’s Mechanical Turk, 2010,
neural image caption generator,” in CVPR, 2015, pp. 3156–3164
pp. 139–147.
[18] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation [41] X. Chen et al., “Microsoft COCO captions: Data collection and
by jointly learning to align and translate,” arXiv preprint evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
arXiv:1409.0473, 2014.
[42] R. Bernardi et al., “Automatic Description Generation from Images:
[19] J. Donahue et al., “Long-term recurrent convolutional networks for A Survey of Models, Datasets, and Evaluation Measures,” JAIR,
visual recognition and description,” in CVPR, 2015, pp. 2625– vol. 55, pp. 409–442, 2016.
2634.
[43] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
[20] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self- descriptions to visual denotations: New similarity metrics for
critical sequence training for image captioning,” in CVPR, 2017, semantic inference over event descriptions,” Transactions of the
pp. 7008–7024. Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
[21] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level [44] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-
training with recurrent neural networks,” arXiv preprint based image description evaluation,” in CVPR, 2015, pp. 4566–
arXiv:1511.06732, 2015. 4575.
[22] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved [45] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice:
Image Captioning via Policy Gradient Optimization of SPIDEr,” Semantic propositional image caption evaluation,” in ECCV 2016,
arXiv preprint arXiv:1612.00370, 2016. 2016, pp. 382–398.
[23] B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and [46] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-
natural image descriptions via a conditional gan,” in ICCV, 2017, evaluating automatic metrics for image captioning,” arXiv preprint
pp. 2970–2979. arXiv:1612.07600, 2016.
[24] B. Dai and D. Lin, “Contrastive learning for image captioning,” in [47] D. Elliot and F. Keller, “Comparing automatic evaluation measures
Advances in Neural Information Processing Systems, 2017, pp. for image description,” in Proceedings of the 52nd Annual Meeting
898–907.
of the Association for Computational Linguistics: Short Papers,
[25] K. Xu et al., “Show, attend and tell: Neural image caption 2014, pp. 452–457.
generation with visual attention,” in ICML, 2015, pp. 2048–2057.
[48] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-
[26] H. Fang et al., “From captions to visual concepts and back,” in guided sentence generation of natural images,” in Proceedings of
CVPR, 2015, pp. 1473–1482. the Conference on Empirical Methods in Natural Language
[27] M. Ivašić-Kos, M. Pobar, S. Ribarić, “Automatic image annotation Processing, 2011, pp. 444–454.
refinement using fuzzy inference algorithms”, IFSA- EUSFLAT [49] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method
2015, Gijón, Asturias, (Spain) p. 242 for automatic evaluation of machine translation,” in Proceedings of
[28] W. Lan, X. Li, and J. Dong, “Fluency-guided cross-lingual image the 40th annual meeting on association for computational
captioning,” in Proceedings of the 25th ACM international linguistics, 2002, pp. 311–318.
conference on Multimedia, 2017, pp. 1549–1557. [50] M. Denkowski and A. Lavie, “Meteor universal: Language-specific
[29] A. P. Mathews, L. Xie, and X. He, “Senticap: Generating image translation evaluation for any target language,” in 9th Workshop on
descriptions with sentiments,” in 13th AAAI Conference on Statistical Machine Translation, 2014, pp. 376–380.
Artificial Intelligence, 2016. [51] C. Y. Lin, “Rouge: A package for automatic evaluation of
[30] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning summaries,” in Text summarization branches out: Proceedings of
with semantic attention,” in CVPR, 2016, pp. 4651–4659. the ACL-04 workshop, 2004, vol. 8.
[31] P. Anderson et al., “Bottom-up and top-down attention for image [52] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller,
captioning and vqa,” arXiv preprint arXiv:1707.07998, 2017. “Introduction to WordNet: An on-line lexical database,”
[32] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: International Journal of Lexicography, vol. 3, no. 4, pp. 235–244,
Adaptive attention via A visual sentinel for image captioning,” 1990.
arXiv preprint arXiv:1612.01887, 2016. [53] J. Johnson et al., “Image retrieval using scene graphs,” in CVPR,
[33] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating 2015, pp. 3668–3678.
attractive visual captions with styles,” in CVPR, 2017, pp. 3137– [54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:
3146. Lessons learned from the 2015 MS COCO image captioning
[34] Q. You, H. Jin, and J. Luo, “Image captioning at will: A versatile challenge,” IEEE Transactions on Pattern Analysis and Machine
scheme for effectively injecting sentiments into image Intelligence, vol. 39, no. 4, pp. 652–663, 2017.
descriptions,” arXiv preprint arXiv:1801.10121, 2018.