0% found this document useful (0 votes)
6 views7 pages

Ref12

The paper provides an overview of deep image captioning, detailing the evolution from template-based models to advanced deep neural network architectures, particularly the encoder-decoder framework. It discusses the advantages and challenges of various approaches, including the use of attention mechanisms and different learning methods, while categorizing image captioning into standard, stylistic, and multilingual types. The authors also highlight the importance of evaluation metrics and datasets in assessing the performance of image captioning systems.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Ref12

The paper provides an overview of deep image captioning, detailing the evolution from template-based models to advanced deep neural network architectures, particularly the encoder-decoder framework. It discusses the advantages and challenges of various approaches, including the use of attention mechanisms and different learning methods, while categorizing image captioning into standard, stylistic, and multilingual types. The authors also highlight the importance of evaluation metrics and datasets in assessing the performance of image captioning systems.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Ref 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/334417783

Deep Image Captioning: An Overview

Conference Paper · May 2019


DOI: 10.23919/MIPRO.2019.8756821

CITATIONS READS

26 654

2 authors:

Ingrid Hrga Marina Ivašić-Kos

1 PUBLICATION 26 CITATIONS
University of Rijeka
77 PUBLICATIONS 1,306 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Marina Ivašić-Kos on 19 February 2020.

The user has requested enhancement of the downloaded file.


20–24 May 2019
Ref 12

Deep Image Captioning: An Overview


I. Hrga*, M. Ivašić-Kos**
*Juraj Dobrila University of Pula/Faculty of Informatics, Pula, Croatia
**University of Rijeka/Department of Informatics, Rijeka, Croatia
ingrid.hrga@unipu.hr, marinai@uniri.hr

Abstract – Image captioning is a process of automatically strained their flexibility. Some authors [10] have reformu-
describing an image with one or more natural language lated image captioning as a ranking task. Ranking-based
sentences. In recent years, image captioning has witnessed approaches always return well-formed sentences, but they
rapid progress, from initial template-based models to the cannot generate new sentences or to describe composition-
current ones, based on deep neural networks. This paper gives ally new images [11], i.e., those containing objects that
an overview of issues and recent image captioning research, were observed during training but appear in different com-
with a particular emphasis on models that use the deep binations on the test image. In contrast, today’s state-of-the-
encoder-decoder architecture. We discuss the advantages and art models are generative and neural networks based. They
disadvantages of different approaches, along with reviewing
usually employ an encoder-decoder architecture by com-
some of the most commonly used evaluation metrics and
datasets.
bining a Convolutional Neural Network (CNN) with a Re-
current Neural Network (RNN).
Keywords – image captioning, encoder-decoder, attention The rest of the paper is organized as follows: The next
mechanism, deep neural networks Section provides some background information on the typ-
I. INTRODUCTION ical architecture of image captioning systems. Section III.
groups image captioning models according to the caption-
Recent advances in deep learning methods on percep- ing task and describes relevant models for each type. Sec-
tual tasks, such as image classification and object detection tion IV. presents some of the most commonly used data
[1, 2] have encouraged researchers to tackle even more dif- sets, along with a description of how they were collected.
ficult problems for which recognition is just a step towards Section V. lists the metrics and points to the problems that
to more complex reasoning about our visual world [3]. Im- arise when evaluating generative approaches. The paper
age captioning is one of such tasks. ends with a Conclusion.
The aim of image captioning is to automatically de- II. ARCHITECTURE AND LEARNING APPROACHES
scribe an image with one or more natural language sen-
tences. This is a problem that integrates computer vision A. Encoder-Decoder Framework
and natural language processing, so its main challenges Inspired by its success in Neural Machine Translation
arise from the need of translating between two distinct, but [12], many of the current state-of-the-art models for image
usually paired, modalities [4]. First, it is necessary to detect captioning employ the encoder-decoder architecture (Fig.
objects on the scene and determine the relationships be- 1). In this architecture, the encoder is used to map the input
tween them [5] and then, express the image content cor- into its real-valued fixed-dimensional vector representa-
rectly with properly formed sentences. The generated de- tion. A decoder then generates output, conditioned on the
scription is still much different from the way people de- representation produced by the encoder. The main ad-
scribe images because people rely on common sense and vantage of such a system is that it can be trained end-to-
experience, point out important details and ignore objects end, meaning that the parameters of the whole network are
and relationships that they imply [6]. Moreover, they often learned together, thereby avoiding the problem of aligning
use imagination to make descriptions vivid and interesting. several independent components.
Regardless of the existing limitations, image Image captioning is often understood as a task of trans-
captioning has already been proven to have useful lating one modality, i.e. an image, into another modality,
applications, such as helping visually impaired people in i.e. its description, so the encoder-decoder architecture has
performing daily tasks. Automatically generated been successfully applied with a convolutional neural net-
descriptions can also be used for content-based retrieval work (CNN) [13] on the encoder side, and a recurrent neu-
[7] or in social media communications. ral network (RNN) [14] on the decoder side.
Early image captioning approaches relied on the use of A CNN acts as a feature extractor that is usually pre-
predefined templates, which were filled in based on the re- trained on a large dataset for a classification task [15]. A
sults of the detection of elements on the scene [8, 9]. How- feature map from a convolutional layer or the vector repre-
ever, the advantage of such bottom-up approaches in terms sentation from a fully-connected layer is then used as image
of the ability to capture details was not enough to keep them representation. An RNN or one of its variants, such as the
in the focus of research interest. Generated sentences were long short-term memory (LSTM) network [16], is em-
too simple, lacking the fluency of human writing. Moreo- ployed for language modeling.
ver, such systems were heavily hand-designed, which con-
Figure 1. Encoder-decoder framework for image captioning: first a CNN encoder produces image representation (left), then an LSTM decoder gen-
erates caption conditioned on the representation produced by the encoder (right)

into a context vector of a predefined size. Instead, the au-


B. Learning thors proposed to encode the input into a set of vectors.
Majority of the encoder-decoder image captioning
models use Maximum Likelihood Estimation (MLE) as The first work to employ an attention mechanism on
their learning method. In the supervised learning setting, the task of image captioning was [25]. In the proposed
with training examples consisting of image-caption pairs, model, image features are extracted from a lower
the model maximizes the probability of the correct caption convolutional layer of a CNN as a set of L annotation
given the image [17]: vectors summarizing a pre-defined
spatial location of the image. To each annotation vector, a
(1) positive weight is assigned, indicating the amount of
attention each image feature receives. The attention weight
where I is the image, is the corre- is computed by an attention model :
sponding caption of length N, are the parameters of the (4)
model. The joint probability over words can be expressed
as follows: , (5)

(2) where is the previous hidden state.


After obtaining the attention weights, the attention
where the dependency on is dropped for simplifica- mechanism computes the context vector as a dynamic
tion. representation of the relevant parts of the image at a given
To model usually an LSTM is time step:
employed, which is trained to predict the next word
conditioned on all the previously predicted words (6)
and the context vector c produced by the
encoder [18, 32]: The context vector is then used to update the hidden
(3) state of the decoder.
where is a nonlinear function that outputs the III. IMAGE CAPTIONING TYPES
probability of is the hidden state of the LSTM at time
step . We have grouped methods and models of image cap-
tioning given the task into three types: (1) standard image
Novel sentences can be generated by randomly captioning, (2) image captioning with style and (3) cross-
sampling from the model's distribution or by using beam lingual and multilingual image captioning.
search [19, 17].
A. Standard Image Captioning
Although effective, some limitations of MLE learning
have motivated the adoption of alternative learning meth- For the correct description of the image, it is necessary
ods. Reinforcement learning [20] can be used to address the to:
exposure bias problem [21] and for the direct optimization (1) detect the content of the image in terms of objects,
of the standard evaluation metrics [22]. For increasing the attributes, relationships, with the conclusion of what is new
diversity of generated captions, conditional GAN frame- or interesting [26, 27],
work [23] or contrastive learning [24] were proposed.
(2) express the represented semantic content with
C. Attention Mechanism properly formulated sentences [17] that are suitable for the
It was demonstrated in [18] that the fixed-length vector image they describe [10].
representation produced by the encoder is responsible for The captions generated by most of the contemporary
the degradation of the performance that occurs as the length methods usually represent an objective and neutral descrip-
of input increases. Regardless of the size of the input, in the tion of the factual content of the scene.
basic encoder-decoder all the information is compressed
An example of a model designed to generate new cap- context vector of the spatial attention model and the visual
tions with previously unseen combinations of objects is re- sentinel vector. It was shown that the ability to decide when
ported in [11]. Authors proposed a multimodal Recurrent to attend to the image was also useful for better directing
Neural Network (m-RNN) framework that is adapted to attention to the appropriate image regions, which allowed
both the retrieval as well as to the sentence generation task. the model to achieve state-of-the-art results.
The model consists of a CNN and RNN, which interact with
each other in a multimodal layer receiving three inputs: the B. Image Captioning with Style
word embedding layer, the recurrent layer, an image repre- There are two lines of work focused on enriching cap-
sentation. A final soft-max layer generates the probability tions with more emotional content. The first group of au-
distribution of the next word. thors includes viewer’s attitude and emotions towards the
image [29, 33, 34, 35], the second line of work includes
In [17] authors introduce an end-to-end trainable Neural emotional content from the image itself [36].
Image Caption (NIC) system, similar to [11] but with an
LSTM variant of RNN as the decoder. The authors propose The authors in [29] were the first to incorporate positive
to use maximum likelihood estimation (MLE) principle for and negative sentiments into captions. They proposed Sen-
training the model. For its effectiveness, NIC became one tiCap, a switching RNN model with word-level regulariza-
of the most influential models, and other authors developed tion which emphasizes sentiments. Two networks, consist-
extensions on top of it [28, 29]. ing of a CNN and an RNN, were used to generate stylized
captions. One network was trained on a large image-caption
A similar end-to-end Long-Term Recurrent Convolu- dataset to generate standard factual descriptions, and the
tional Network (LRCN), combining a CNN encoder and an other was trained on a small dataset with sentiment polarity.
LSTM decoder, is introduced in [19]. Authors investigated Experiments showed that SentiCap was able to include the
the effects of different architectures and found that using appropriate sentiment in 74% of the generated sentences.
LSTM instead of a simple RNN, combined with a more
powerful CNN, contributed to better performance. Adding The SentiCap model is limited in its ability to scale be-
more LSTM layers did not bring expected improvements. cause it requires words labeled with sentiment strength. To
address this issue, the authors in [33] propose StyleNet, an
Different from the spatial attention model introduced in end-to-end trainable model which generates captions in a
[25], the authors in [30] proposed a semantic attention humorous or romantic style (Figure 2). A factored LSTM
model which combines different sources of visual infor- is used to factorize the weight matrices to account for fac-
mation through a feedback process to attend to fine details tual and non-factual aspects of the sentences. Multi-task
in images while having an end-to-end trainable system. learning is used to optimize the generation of factual cap-
Top-down features, extracted from the last convolutional tions, and stylized language modeling. Almost 85% of the
layer of a CNN, serve as a guide where to attend. A set of human evaluators found the stylized captions to be more
bottom-up attributes are detected as candidates for atten- attractive than the corresponding factual descriptions.
tion. Those with highest attention scores are then used by
the attention mechanism which learns to attend to the se- In [34] authors propose two mechanisms to inject sen-
mantically important concepts. Since irrelevant attributes timent into captions: (1) direct injection, in which sentiment
may redirect attention to wrong concepts, attribute predic- is injected as an additional dimension at each time step, (2)
tion plays a crucial role. injection by sentiment flow, in which sentiment is provided
only at the first time step and then propagated over the
A similar approach is presented in [31] where authors whole sentence by a sentiment cell. Experiments showed
combine top-down and bottom-up attention processing to that both methods were able to add sentiments, but direct
calculate attention at the object-level. Instead of treating de- injection generated more captions with sentiments.
tected objects as bag-of-words that do not retain spatial in-
formation, they propose a different, feature-based ap-
proach. Bottom-up attention mechanism, based on Faster
R-CNN [2], proposes a set of salient image regions. Com-
bined with the more traditional top-down approach, this al-
lows us to reveal the structure of the scene better and to
interpret better the relationships between objects, which be-
comes important in dealing with compositionally new im-
ages.
Previously described attention models have a limitation
Figure 2. Comparison of factual and stylized captions [35]
in that they cannot selectively decide when to focus atten-
tion on the image. In [32] authors argue that directing at- FaceCap [36] model presents a different point of view
tention to the image at every time step becomes unneces- and embraces the emotions detected in facial expressions
sary for words that do not have a corresponding visual sig- of people depicted in the images. It relies on the use of a
nal such as “a “, “for” etc. They introduce an adaptive at- facial expression recognition model to extract facial fea-
tention model that automatically decides whether to focus tures, which can be then used by an LSTM to generate cap-
attention on the image or to use information stored in the tions. The authors observed that the model has improved in
decoder’s memory. An LSTM extension, called sentinel describing the actions on the scene.
gate, produces an additional visual sentinel vector, which is
used when the model decides not to attend to the image.
The new context vector is modeled as a combination of the
C. Cross-Lingual and Multilingual Image Captioning image gathering, obtaining appropriate descriptions turned
Cross-lingual and multilingual image captioning refers out to be much more challenging.
to the task of generating a caption in one language given a As [10] pointed out, captions provided by users of
corpus of descriptions in one or more different languages photo-sharing websites are not suitable for the training of
[37]. automatic image captioning systems. Such captions usually
Several approaches exist to tackle such tasks such as di- provide broader context, i.e., additional information that
rect translation of generated captions, collecting a new da- cannot be obtained by the image alone. Instead of using
taset in a target language and its use for training, or learning non-visual descriptions, [10] suggested focusing on general
a model from machine-translated texts. The first approach conceptual descriptions, i.e., those that refer to objects, at-
can give inferior results, among others because direct trans- tributes, events and other literal content of the image. Such
lations can worsen the errors in the generated descriptions descriptions are collected, on a large-scale, through
[28]. Therefore, researchers are primarily focused on devel- crowdsourcing services, such as Amazon Mechanical Turk
oping models that will be able to cope with different lan- (AMT) [40, 10, 41], which involves defining a task that is
guages directly. performed by untrained workers [42]. Due to the low cost
and high speed, crowdsourcing became the preferred way
Authors in [38] treat the problem as a visually-grounded of collecting image descriptions for large-scale datasets.
machine translation task in which the image is used to re-
solve ambiguity. In the proposed multilingual image de- B. Datasets
scription model, visual features are complemented with UIUC PASCAL Sentences [40] was one of the first im-
textual features of the source language (English) to gener- age-caption datasets, consisting of 1,000 images and asso-
ate captions in German. ciated with five different descriptions collected via
crowdsourcing. It was used by early image captioning sys-
The authors in [37] transfer the knowledge obtained tems [8], but due to its limited domain, small size, and rel-
from learning on English captions to generate captions in atively simple captions it is rarely used.
Japanese as the target language. The model was first pre-
trained on the large English dataset. Then, the trained Flickr 30K [43] includes and extends previous Flickr
LSTM was replaced with the new one, trained on the much 8K [10, 40] dataset. It consists of 31,783 images showing
smaller corpus of Japanese captions. The authors noticed everyday activities, events, and scenes described by
that pretraining had the effect of learning on additional 158,915 captions obtained via crowdsourcing.
10,000 images with Japanese captions. Moreover, the use
of captions that are not direct translations made the model Microsoft COCO Captions [41] dataset contains more
complex images of everyday objects and scenes. By adding
easier to scale.
human generated captions, two datasets were created: c5
Authors in [39] propose a single model capable of gen- with five captions for each of the more than 300K images
erating captions in multiple languages, but with a strong as- and an additional, c40 dataset with 40 different captions for
sumption that the images with captions in different lan- the randomly chosen 5K images. The c40 was created be-
guages are readily available. A token, provided as initial in- cause it was observed [44] that some evaluation metrics
put, controls the choice of the target language. benefit from more reference captions.
Previous approaches rely on datasets with human-writ- Flickr 30K and MS COCO Captions are widely ac-
ten captions in different languages. In [28] authors adopt a cepted as benchmark datasets for image captioning by most
cheaper solution by using machine-translated text. To over- models using deep neural networks.
come the lack of fluency of such translations, they intro-
duce a fluency estimation module to assign an importance V. EVALUATION
score to the captions that are then chosen for training. Ex- Accessing the accuracy of automatically generated im-
periments were performed on English-Chinese datasets and age captioning is a demanding task [44, 45]. The same im-
showed that the model, trained on a smaller dataset from age can be described in various ways, focusing on different
which less fluent sentence where excluded, achieved com- parts of the image, using a different level of abstraction or
parable results to the baseline, trained on all the machine- different level of knowledge about objects on the scene. It
generated sentences (fluent or not). is obvious that by emphasizing different aspects of the im-
IV. DATASETS age, the resulting sentences can vary significantly while at
the same time being entirely correct. In contrary, two cap-
The development of this research area greatly depends tions can share most of the words and convey a different
on the availability of large datasets that contain images with
meaning.
corresponding descriptions. In addition to the size of the
dataset, an image captioning model benefits also signifi- Evaluation of automatically generated captions can be
cantly from the quality of captions in the spirit of natural performed by human subjects, either by experts [10] or by
language and their adaptation to a given task. untrained workers through crowdsourcing platforms [19,
22]. However, human-based evaluation creates additional
A. Collecting datasets costs; it is slow, subjective and difficult to reproduce [10,
Images are collected primarily from photo-sharing ser- 46]. A better alternative is the use of automatic metrics,
vices, mostly Flickr1 or by harvesting the web. Unlike the which, in turn, are fast, accurate and inexpensive [45]. To

1
https://www.flickr.com
be useful, metrics should match the rating of human evalu- VI. CONCLUSION
ators, but it turned out to be a goal that is difficult to This paper presents an overview of recent advances in
achieve. Evaluation metrics should satisfy two criteria [22]: image captioning research, with a particular focus on mod-
(1) captions that are considered good by humans should els employing deep encoder-decoder architectures. The
achieve high scores, (2) captions that achieve high scores main advantage of such architectures is in that they are
should be considered good by humans. trainable end-to-end, mapping directly from images to sen-
Image captioning is sometimes compared [47] to lan- tences.
guage translation [8, 17] or with text summarization [48], An important extension of the basic encoder-decoder
which motivated the adaption of metrics developed initially framework is the attention mechanism, which enables to fo-
for the evaluation of languages tasks [49, 50, 51]. All these cus on the most salient parts of the input image while gen-
metrics output a score indicating a similarity between the erating the next word of the output. We group the related
candidate sentence and the reference sentences. work into three types regarding the task: standard image
BLEU [49] is a popular metric for machine translation captioning (with or without an attention mechanism), im-
evaluation and one of the first metrics used to evaluate im- age captioning with style and cross-lingual or multilingual
age descriptions. It computes the geometric mean of n-gram image captioning.
precision scores multiplied by a brevity penalty in order to Large vision and language datasets have also contributed
avoid overly short sentences. significantly to the development of the field. Additional
METEOR [50] is another machine translation metric. It features of the new datasets, such as emotions or descrip-
relies on the use of stemmers, WordNet [52] synonyms and tions in different languages, will certainly stimulate even
paraphrases tables to identify matches between candidate faster advances in the periods to come.
sentence and reference sentences. However, there are some important tasks that are still un-
ROUGE [51] is a package of measures initially devel- resolved, such as generating captions more in the spirit of
oped for the evaluation of text summaries. For image cap- the human descriptions, automatic adaptation of descrip-
tioning, a variant ROUGEL is usually used, which com- tions to the given task, and perhaps the most challenging
putes F-measure based on the Longest Common Subse- among them, automatic assessment of the generated cap-
quence (LCS), i.e. a set of words shared by two sentences tions, since there are still no metrics to match human eval-
which occur in the same order, without requiring consecu- uation fully.
tive matches.
CIDEr [44] is a metric designed for the evaluation of ACKNOWLEDGMENT
automatically generated image captions. It measures the This research was supported by Croatian Science Foun-
similarity between the candidate sentence and a set of hu- dation under the project IP-2016-06-8345 “Automatic
man-written sentences by performing a Term Frequency recognition of actions and activities in multimedia content
Inverse Document Frequency (TF-IDF) weighting for each from the sports domain” (RAASS) and by the University of
n-gram. Rijeka under the project number 18-222-1385.
SPICE [45] is a metric designed for image caption eval-
uation. It measures the quality of generated captions by REFERENCES
computing an F-measure based on the propositional seman- [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
tic content of candidate and reference sentences represented classification with deep convolutional neural networks,” in
NIPS’12, 2012, vol. 1, pp. 1097–1105.
as scene graphs [53].
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
The metrics above represent a standard set of metrics real-time object detection with region proposal networks,” arXiv
usually reported in papers. Their popularity can be at- preprint arXiv:1506.01497, 2015.
tributed to their availability through the Microsoft COCO [3] R. Krishna et al., “Visual genome: Connecting language and vision
caption evaluation server [41], which enables a consistent using crowdsourced dense image annotations,” International
comparison of different models. Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
However, it was shown [47, 10] that automatic metrics [4] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural
language models,” in ICML-14, 2014, pp. 595–603.
do not always correlate with human judgments. This was
[5] M. Ivašić-Kos, I. Ipšić, S. Ribarić, “A knowledge-based multi-
particularly evident during the Microsoft COCO 2015 Cap- layered image annotation system.”, Expert systems with
tioning Challenge in that some models outperformed hu- applications. 42 (2015), 2015; 9539-9553
man upper bound according to automatic metrics, but hu- [6] M. Ivašić-Kos, M. Pavlić, M. Pobar, “Analyzing the semantic level
man judges demonstrated a preference for human-written of outdoor image annotation”, MIPRO 2009, IEEE Opatija, pp.
293-296
captions [54]. It seems that “humans do not always like
[7] M. Pobar, M. Ivašić-Kos, “Multimodal Image Retrieval Based on
what is human-like” [44]. Since there is no best metric, Keywords and Low-Level Image Features”, Semantic Keyword-
some authors [45, 46] advise the use of an ensemble of met- based Search on Structured Data, IKC 2015, Coimbra, Portugal,
rics capturing various dimensions, such as grammaticality, Springer, 2015. 133-140
saliency, correctness or truthfulness. In [22, 46] new eval- [8] G. Kulkarni et al., “Babytalk: Understanding and generating simple
uation metrics were proposed. image descriptions,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
[9] M. Ivašić-Kos, M. Pobar, S. Ribarić, Two-tier image annotation
model based on a multi-label classifier and fuzzy-knowledge
representation scheme, Pattern recognition. 52 (2016); 287-305
[10] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image [35] T. Chen et al., “Factual or Emotional: Stylized Image Captioning
description as a ranking task: Data, models and evaluation with Adaptive Learning and Attention,” in ECCV, 2018, pp. 519–
metrics,” JAIR, vol. 47, pp. 853–899, 2013. 535.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep [36] O. M. Nezami, M. Dras, P. Anderson, and L. Hamey, “Face-Cap:
captioning with multimodal recurrent neural networks (m-rnn),” Image Captioning Using Facial Expression Analysis,” in Joint
arXiv preprint arXiv:1412.6632, 2014. European Conference on Machine Learning and Knowledge
[12] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence Discovery in Databases, 2018, pp. 226–240.
learning with neural networks,” in NIPS’14, 2014, vol. 2, pp. 3104– [37] T. Miyazaki and N. Shimizu, “Cross-lingual image caption
3112. generation,” in Proceedings of the 54th Annual Meeting of the
[13] Y. LeCun and Y. Bengio, “Convolutional networks for images, Association for Computational Linguistics (Volume 1: Long
speech, and time series,” The Handbook of Brain Theory and Papers), 2016, vol. 1, pp. 1780–1790.
Neural Networks, vol. 3361, no. 10, 1995. [38] D. Elliott, S. Frank, and E. Hasler, “Multilingual image description
[14] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, with neural sequence models,” arXiv preprint arXiv:1510.04709,
no. 2, pp. 179–211, 1990. 2015.
[15] O. Russakovsky et al., “Imagenet large scale visual recognition [39] S. Tsutsui and D. Crandall, “Using artificial tokens to control
challenge,” International Journal of Computer Vision, vol. 115, no. languages for multilingual image caption generation,” arXiv
3, pp. 211–252, 2015. preprint arXiv:1706.06275, 2017.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [40] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier,
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. “Collecting image annotations using Amazon’s Mechanical Turk,”
in Proceedings of the NAACL HLT 2010 Workshop on Creating
[17] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A Speech and Language Data with Amazon’s Mechanical Turk, 2010,
neural image caption generator,” in CVPR, 2015, pp. 3156–3164
pp. 139–147.
[18] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation [41] X. Chen et al., “Microsoft COCO captions: Data collection and
by jointly learning to align and translate,” arXiv preprint evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
arXiv:1409.0473, 2014.
[42] R. Bernardi et al., “Automatic Description Generation from Images:
[19] J. Donahue et al., “Long-term recurrent convolutional networks for A Survey of Models, Datasets, and Evaluation Measures,” JAIR,
visual recognition and description,” in CVPR, 2015, pp. 2625– vol. 55, pp. 409–442, 2016.
2634.
[43] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
[20] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self- descriptions to visual denotations: New similarity metrics for
critical sequence training for image captioning,” in CVPR, 2017, semantic inference over event descriptions,” Transactions of the
pp. 7008–7024. Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
[21] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level [44] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-
training with recurrent neural networks,” arXiv preprint based image description evaluation,” in CVPR, 2015, pp. 4566–
arXiv:1511.06732, 2015. 4575.
[22] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved [45] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice:
Image Captioning via Policy Gradient Optimization of SPIDEr,” Semantic propositional image caption evaluation,” in ECCV 2016,
arXiv preprint arXiv:1612.00370, 2016. 2016, pp. 382–398.
[23] B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and [46] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-
natural image descriptions via a conditional gan,” in ICCV, 2017, evaluating automatic metrics for image captioning,” arXiv preprint
pp. 2970–2979. arXiv:1612.07600, 2016.
[24] B. Dai and D. Lin, “Contrastive learning for image captioning,” in [47] D. Elliot and F. Keller, “Comparing automatic evaluation measures
Advances in Neural Information Processing Systems, 2017, pp. for image description,” in Proceedings of the 52nd Annual Meeting
898–907.
of the Association for Computational Linguistics: Short Papers,
[25] K. Xu et al., “Show, attend and tell: Neural image caption 2014, pp. 452–457.
generation with visual attention,” in ICML, 2015, pp. 2048–2057.
[48] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-
[26] H. Fang et al., “From captions to visual concepts and back,” in guided sentence generation of natural images,” in Proceedings of
CVPR, 2015, pp. 1473–1482. the Conference on Empirical Methods in Natural Language
[27] M. Ivašić-Kos, M. Pobar, S. Ribarić, “Automatic image annotation Processing, 2011, pp. 444–454.
refinement using fuzzy inference algorithms”, IFSA- EUSFLAT [49] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method
2015, Gijón, Asturias, (Spain) p. 242 for automatic evaluation of machine translation,” in Proceedings of
[28] W. Lan, X. Li, and J. Dong, “Fluency-guided cross-lingual image the 40th annual meeting on association for computational
captioning,” in Proceedings of the 25th ACM international linguistics, 2002, pp. 311–318.
conference on Multimedia, 2017, pp. 1549–1557. [50] M. Denkowski and A. Lavie, “Meteor universal: Language-specific
[29] A. P. Mathews, L. Xie, and X. He, “Senticap: Generating image translation evaluation for any target language,” in 9th Workshop on
descriptions with sentiments,” in 13th AAAI Conference on Statistical Machine Translation, 2014, pp. 376–380.
Artificial Intelligence, 2016. [51] C. Y. Lin, “Rouge: A package for automatic evaluation of
[30] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning summaries,” in Text summarization branches out: Proceedings of
with semantic attention,” in CVPR, 2016, pp. 4651–4659. the ACL-04 workshop, 2004, vol. 8.
[31] P. Anderson et al., “Bottom-up and top-down attention for image [52] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller,
captioning and vqa,” arXiv preprint arXiv:1707.07998, 2017. “Introduction to WordNet: An on-line lexical database,”
[32] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: International Journal of Lexicography, vol. 3, no. 4, pp. 235–244,
Adaptive attention via A visual sentinel for image captioning,” 1990.
arXiv preprint arXiv:1612.01887, 2016. [53] J. Johnson et al., “Image retrieval using scene graphs,” in CVPR,
[33] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating 2015, pp. 3668–3678.
attractive visual captions with styles,” in CVPR, 2017, pp. 3137– [54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:
3146. Lessons learned from the 2015 MS COCO image captioning
[34] Q. You, H. Jin, and J. Luo, “Image captioning at will: A versatile challenge,” IEEE Transactions on Pattern Analysis and Machine
scheme for effectively injecting sentiments into image Intelligence, vol. 39, no. 4, pp. 652–663, 2017.
descriptions,” arXiv preprint arXiv:1801.10121, 2018.

View publication stats

You might also like