An Empirical Study of Language CNN For Image Captioning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

An Empirical Study of Language CNN for Image Captioning

Jiuxiang Gu1 , Gang Wang2 , Jianfei Cai3 , Tsuhan Chen3


1
ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore
2
Alibaba AI Labs, Hangzhou, China
3
School of Computer Science and Engineering, Nanyang Technological University, Singapore
{jgu004, asjfcai, tsuhan}@ntu.edu.sg, gangwang6@gmail.com

Abstract LSTM networks can be sequence learners. However, due


to the vanishing gradient problem, RNNs can only remem-
Language models based on recurrent neural networks ber the previous status for a few time steps. LSTM network
have dominated recent image caption generation tasks. In is a special type of RNN architecture designed to solve the
this paper, we introduce a language CNN model which is vanishing gradient problem in RNNs [46, 15, 6]. It intro-
suitable for statistical language modeling tasks and shows duces a new component called memory cell. Each memory
competitive performance in image captioning. In contrast cell is composed of three gates and a neuron with the self-
to previous models which predict next word based on one recurrent connection. These gates allow the memory cells
previous word and hidden state, our language CNN is fed to keep and access information over a long period of time
with all the previous words and can model the long-range and make LSTM network capable of learning long-term de-
dependencies in history words, which are critical for im- pendencies.
age captioning. The effectiveness of our approach is vali- Although models like LSTM networks have memory
dated on two datasets: Flickr30K and MS COCO. Our ex- cells which aim to memorize history information for long-
tensive experimental results show that our method outper- term, they are still limited to several time steps because
forms the vanilla recurrent neural network based language long-term information is gradually diluted at every time
models and is competitive with the state-of-the-art methods. step [49]. Besides, vanilla RNNs-based image captioning
models recursively accumulate history information without
explicitly modeling the hierarchical structure of word se-
quences, which clearly have a bottom-up structure [28].
1. Introduction To better model the hierarchical structure and long-term
Image caption generation is a fundamental problem that dependencies in word sequences, in this paper, we adopt a
involves Computer Vision, Natural Language Processing language CNN which applies temporal convolution to ex-
(NLP), and Machine Learning. It can be analogous to tract features from sequences. Such a method is inspired by
“translating” an image to proper sentences. While this task works in NLP which have shown CNN is very powerful for
seems to be easy for human beings, it is quite challenging text representation [18, 48]. Unlike the vanilla CNN archi-
for machines because it requires the model to understand tecture, we drop the pooling operation to keep the relevant
the image content and express their relationships in a natural information for words representation and investigate the op-
language. Also, the image captioning model should be ca- timum convolutional filters by experiments. However, only
pable of capturing implicit semantic information of an im- using language CNN fails to model the dynamic temporal
age and generating humanlike sentences. As a result, gen- behavior. Hence, we still need to combine language CNN
erating accurate captions for an image is not an easy task. with recurrent networks (e.g., RNN or LSTM). Our exten-
The recent surge of research interest in image cap- sive studies show that adding language CNN to a recurrent
tion generation task is due to the advances in Neural Ma- network helps model sequences consistently and more ef-
chine Translation (NMT) [44] and large datasets [39, 29]. fectively, and leads to improved results.
Most image captioning models follow the encoder-decoder To summarize, our primary contribution lies in incor-
pipeline [4, 24, 35, 19, 41]. The encoder-decoder frame- porating a language CNN, which is capable of capturing
work is recently introduced for sequence-to-sequence learn- long-range dependencies in sequences, with RNNs for im-
ing based on Recurrent Neural Networks (RNNs) or Long- age captioning. Our model yields comparable performance
Short Term Memory (LSTM) networks. Both RNNs and with the state-of-the-art approaches on Flickr30k [39] and

1222
MS COCO [29]. generation. Likewise, Wu et al. [50] train several visual at-
tribute classifiers and take the outputs of those classifiers as
2. Related Works inputs for the LSTM network to predict words.
In general, current recurrent neural network based ap-
The problem of generating natural language descriptions proaches have shown their powerful capability on mod-
for images has become a hot topic in computer vision com- eling word sequences [46, 19]. However, the history-
munity. Prior to using neural networks for generating de- summarizing hidden states of RNNs are updated at each
scriptions, the classical approach is to pose the problem time, which render the long-term memory rather diffi-
as a retrieval and ranking problem [12, 9, 37]. The main cult [25, 36]. Besides, we argue that current recurrent net-
weakness of those retrieval-based approaches is that they works like LSTM are not efficient on modeling the hierar-
cannot generate proper captions for a new combination of chical structure in word sequences. All of these prompt us
objects. Inspired by the success of deep neural networks in to explore a new language model to extract better sentence
machine translation [44, 4, 17], researchers have proposed representation. Considering ConvNets can be stacked to ex-
to use the encoder-decoder framework for image caption tract hierarchical features over long-range contexts and have
generation [21, 35, 19, 46, 6, 3, 26]. Instead of translating received a lot of attention in many tasks [10], in this paper,
sentences between two languages, the goal of image cap- we design a language CNN to model words with long-term
tioning is to “translate” a query image into a sentence that dependencies through multilayer ConvNets and to model
describes the image. The earliest approach using neural net- the hierarchical representation through the bottom-up and
work for image captioning is proposed by Vinyals et al. [46] convolutional architecture.
which is an encoder-decoder system trained to maximize
the log-likelihood of the target image descriptions. Simi- 3. Model Architecture
larly, Mao et al. [35] and Donahue et al. [6] use the mul-
timodal fusion layer to fuse the image features and word 3.1. Overall Framework
representation at each time step. In both cases, i.e., the
We study the effect of language CNN by combining
models in [35] and [6], the captions are generated from
it with Recurrent Networks. Figure 1 shows a recursive
the full images, while the image captioning model proposed
framework. It consists of one deep CNN for image encod-
by Karpathy et al. [19] generates descriptions based on re-
ing, one CNN for sentence modeling, and a recurrent net-
gions. This work is later followed by Johnson et al. [16]
work for sequence prediction. In order to distinguish these
whose method is designed to jointly localize regions and
two CNN networks, we name the first CNN for image fea-
describe each with captions.
ture extraction as CNNI , and the second CNN for language
Rather than representing an image as a single feature
modeling as CNNL .
vector from the top-layer of CNNs, some researchers have
Given an image I, we take the widely-used CNN ar-
explored the structure of networks to explicitly or implic-
chitecture VGGNet (16-layer) [42] pre-trained on Ima-
itly model the correlation between images and descrip-
geNet [22] to extract the image features V ∈ RK . The
tions [51, 34, 30]. Xu et al. [51] incorporate the spatial
CNNL is designed to represent words and their hierarchi-
attention on convolutional features of an image into the
cal structure in word sequences. It takes a sequence of t
encoder-decoder framework through the “hard” and “soft”
generated words (each word is encoded as a one-hot repre-
attention mechanisms. Their work is followed by Yang et
sentation) as inputs and generates a bottom-up representa-
al. [52] whose method introduces a review network to im-
tion of these words. The outputs of CNNI and CNNL will
prove the attention mechanism and Liu et al. [30] whose
be fed into a multimodal fusion layer, and use the recurrent
approach is designed to improve the correctness of visual
network frecurrent (·) to predict the next word. The following
attention. Moreover, a variational autoencoder for image
equations show the main working flow of our model:
captioning is developed by Pu et al. [40]. They use a CNN
as the image encoder and use a deep generative deconvolu- V= CNNI (I) (1)
tional network as the decoder together with a Gated Recur- [t] [0] [1] [t−1]
y = CNNL (S , S , · · · , S ) (2)
rent Unit (GRU) [4] to generate image descriptions.
[t] [t]
More recently, high-level attributes have been shown to m = fmultimodal (y , V) (3)
obtain clear improvements on the image captioning task r[t] = frecurrent (r[t−1] , x[t−1] , m[t] ) (4)
when injected into existing encoder-decoder based mod-
[t] [t]
els [50, 53, 8]. Specifically, Jia et al. [15] use the semantic S ∼ arg max Softmax(Wo r + bo ) (5)
S
information as the extra input to guide the model in gen-
erating captions. In addition, Fang et al. [7] learn a visual where t ∈ [0, N −1] is the time step, y[t] is the output vector
attributes detector based on multi-instance learning (MIL) of CNNL , r[t] is the activation output of recurrent network,
first and then learn a statistical language model for caption S[t] is the t-th word drawn from the dictionary S according

1223
 a young girl skiing through a snow covered hill
r[t−1] S[t−1] r[t]

V V V V V V

V
CNNL M
Recurrent
Network CNNL M
Recurrent
Network CNNL M
Recurrent
Network CNNL M
Recurrent
Network
CNNL M

y[t] m[t]
Recurrent
Network
∼ CNNL M
Recurrent
Network

CNNI

S[t] ∼ Softmax(Wo r[t] + bo )

a young girl skiing through a snow covered hill <END>


Figure 1. An overview of our framework. The input of our model is a query image. Our model estimates the probability distribution
of the next word given previous words and image. It consists of four parts: a CNNI for image feature extraction, a deep CNNL for
language modeling, a multimodal layer (M) that connects the CNNI and CNNL , and a Recurrent Network (e.g., RNN, LSTM, etc.) for
word prediction. The weights are shared among all time frames.
to the maximum Softmax probability controlled by r[t] , Wo temporal convolution layer-ℓ will be:
and bo are weights and biases used for calculating the dis-
(ℓ) (l) (ℓ−1) (ℓ)
tribution over words. Equation 2, 3, 4 and 5 are recursively yi (x) = σ(wL yi + bL ) (7)
applied, the design of each function is discussed below.
(ℓ)
here yi (x) gives the output of feature map for location i
3.2. CNNL Layer (l)
in Layer-ℓ, wL denotes the parameters on Layer-ℓ, σ(·) is
the activation function, e.g., Sigmoid, or ReLU. The input
Models based on RNNs have dominated recent sequence (l−1)
modeling tasks [23, 31, 32, 44], and most of the recent im- feature map yi is the segment of Layer-ℓ − 1 for the
age captioning models are based on LSTM networks [6, 19, convolution at location i, while y(0) is the concatenation of
34]. However, LSTM networks cannot explicitly model the t word embeddings from the sequence input S[0:t−1] . The
hierarchical representation of words. Even with multi-layer definition of y(0) is as follows:
LSTM networks, such hierarchical structure is still hard to ( T
be captured due to the more complex model and higher risk (0) def x[t−LL ] , · · · , x[t−1] , if t ≥ LL
y =  T
of over-fitting. x[0] , · · · , x[t−1] , x̃[t] , · · · , x̃[LL −1] otherwise
Inspired by the recent success of CNNs in computer vi- (8)
sion [10, 14], we adopt a language CNN with a hierarchi- Specially, when t ≥ LL , the input sentence will be trun-
cal structure to capture the long-range dependencies be- cated, we only use LL words before the current time step t.
tween the input words, called CNNL . The first layer of When t < LL , the input sentence will be padded with x̃[:] .
CNNL is a word embedding layer. It embeds the one-hot Note that if t = 0, x̃[:] are the image features V, otherwise
word encoding from the dictionary into word representa- x̃[:] are the zero vectors that have the same dimension as x[:] .
tion through a lookup table. Suppose we have t input words Previous CNNs, including those adopted for NLP
S = {S[0] , S[1] , · · · , S[t−1] }, and S[i] is the one-of-V (one- tasks [13, 18], take the classic convolution-pooling strategy,
hot) encoding, with V as the size of the vocabulary. We first which uses max-pooling to pick the highest response fea-
map each word S[t] in the sentence into a K-dimensional ture across time. This strategy works well for tasks like text
vector x[t] = We S[t] , where We ∈ RK×V is a word em- classification [18] and matching [13], but is undesirable for
bedding matrix (to be learned). Next, those embeddings are modeling the composition functionality, because it ignores
concatenated to produce a matrix as follows: the temporal information in sequence. In our network, we
discard the pooling operations. We consider words as the
h iT smallest linguistic unit and apply a straightforward stack of
x = x[0] , x[1] , · · · , x[t−1] , x ∈ Rt×K (6) convolution layers on top of each other. In practice, we
find that deeper CNNL works better than shallow CNNL ,
The concatenated matrix x is fed to the convolutional layer. which is consistent with the tradition of CNNs in computer
Just like the normal CNN, CNNL has a fixed architecture vision [10], where using very deep CNNs is key to having
with predefined maximum number of input words (denoted better feature representation.
as LL ). Unlike the toy example in Figure 2, in practice we The output features of the final convolution layer are fed
use a larger and deeper CNNL with LL = 16. into a fully connected layer that projects the extracted words
We use the temporal convolution [21] to model the sen- features into a low-dimensional representation. Next, the
tence. Given an input feature map y(ℓ−1) ∈ RMℓ−1 ×K of projected features will be fed to a highway connection [43]
Layer-ℓ − 1, the output feature map y(ℓ) ∈ RMℓ ×K of the which controls flows of information in the layer and im-

1224
proves the gradient flow. The final output of the highway types of recurrent networks: Simple RNN, LSTM network,
connection is a K-dimensional vector y[t] . GRU [4], and Recurrent Highway Network (RHN) [54].
Traditionally, the simple RNN updates the recurrent state
a S[0] x[0] Transform gate
r[t] of Equation 11 as follows:
young S[1] x[1]
girl S[2] x[2]
× r[t] = tanh(Wr r[t−1] + Wz z[t] + b) (13)
y(`−1) y(`)
[3] [3]
skiing S x
through S[4] x[4]
× + y[t]
where z[t] is the input. However, this type of simple RNN is
aS [5] x [5]
Carry gate hard to deal with long-term dependencies [2]. As the van-
x̃[6]
/
ishing gradient will make gradients in directions that short-
/ x̃[7]
Word Temporal Temporal Fully-connected Highway
term dependencies are large, while the gradients in direc-
Embedding Convolution 1 … Convolution n … Layer Connection tions that correspond to long-term dependencies are small.
Figure 2. The architecture of language CNN for sentence model- LSTM network extends the simple RNN with the gating
ing. Here “/” stands for a zero padding. The CNNL builds a hier- mechanism (input gate, forget gate, and output gate) to con-
archical representation of history words which contains the useful trol information flow and a memory cell to store the history
information for next word prediction. information, thus it can better model the long-term depen-
dencies than simple RNN.
GRU is an architecture similar to the LSTM, but it has
3.3. Multimodal Fusion Layer a simplified structure. GRU does not has a separate mem-
ory cell and exposes its hidden state r[t] without any control.
Next, we add a multimodal fusion layer after CNNL ,
Thus, it is computationally more efficient and outperforms
which fuses words representation and image features. This
the LSTM network on many tasks due to its simple struc-
layer has two inputs: the bottom-up words representation
ture.
y[t] extracted from CNNL and the image representation V
Besides, we also consider a fourth type of recurrent net-
extracted from CNNI . We map these two inputs to the same
work: RHN, which introduces the highway connection to
multimodal feature space and combine them together to ob-
simple RNN. RHN has directly gated connections between
tain the activation of multimodal features:
previous state r[t−1] and current input z[t] to modulate the
flow of information. The transition equations of RHN can
m[t] = fmultimodal (y[t] , V) (9)
  be formulated as follows:
= σ fy (y[t] ; WY , bY ) + gv (V; WV , bV ) (10)  [t]   
t σ   [t−1] 
 c[t]  =  σ  M r [t] (14)
where “+” denotes element-wise addition, fy (·) and gv (·) z
h[t] tanh
are linear mapping functions, m[t] is the multimodal layer
output feature vector. σ(·) is the activation function, here r[t] = h[t] ⊙ t[t] + c[t] ⊙ r[t−1] (15)
we use the scaled tanh function [27] which leads to a faster
where c[t] is the carry gate, t[t] is the transform gate, h[t]
training process than the basic tanh function.
denotes the modulated input, M : R2K+d → R3d is an
3.4. Recurrent Networks affine transformation. z[t] ∈ R2K denotes the concatenation
of two vectors: m[t] and x[t−1] . According to Equation 3
Our CNNL may miss the important temporal informa- and 2, z[t] can be expressed as follows:
tion because it extracts the holistic features from the whole
sequence of words. To overcome this limitation, we com- z[t] = [fmultimodal (CNNL (x[0,··· ,t−1] ), V); x[t−1] ] (16)
bine it with recurrent networks. In our model, the transition
Like GRU, RHN does not have output gate to control the
equations of the recurrent network can be formulated as:
exposure of the recurrent state r[t] , but exposes the whole
r[t] = frecurrent (r[t−1] , x[t−1] , m[t] ) (11) state each time. The RHN, however, does not have reset
[t] [t]
gate to drop information that is irrelevant in the future. As
S ∼ arg max Softmax(Wo r + bo ) (12) our CNNL can extract the relevant information from the se-
S
quence of history words at each time step, to some extent,
where r[t] denotes the recurrent state, x[t−1] = We S[t−1] is the CNNL allows the model to add information that is use-
the previous word embedding, m[t] is the multimodal fusion ful in making a prediction.
output, and frecurrent (·) is the transition function of recurrent
3.5. Training
network. Softmax(r[t] ) is the probability of word S[t] given
by the Softmax layer, and S[t] is the t-th decoded word. During training, given the ground truth words S and cor-
In our study, we combine our language CNN with four responding image I, the loss function for a single training

1225
instance (S, I) is defined as a sum of the negative log like- 3.6.2 Testing
lihood of the words. The loss can be written as:
During testing, the previous output S[t−1] is used as input
N
X −1 in lieu of S[t] . The sentence generation process is straight-
L(S, I) = − log P (S[t] |S[0] , · · · , S[t−1] , I) (17) forward. Our model starts from the <START> token and
t=0 calculates the probability distribution of the next word :
P (S[t] |S[0:t−1] , I). Here we use Beam Search technology
where N is the sequence length, and S[t] denotes a word in proposed in [15], which is a fast and efficient decoding
the sentence S. method for recurrent network models. We set a fixed beam
The training objective is to minimize the cost func- search size (k=2) for all models (with RNNs) in our tests.
tion, which is equivalent to maximizing the probability of
the groundPtruth context words given the image by using: 4. Experiments
N −1
arg maxθ t=0 log P (S[t] |S[0:t−1] , I), where θ are the pa-
rameters of our model, and P (S[t] |S[0:t−1] , I) corresponds 4.1. Datasets and Evaluation Metrics
to the activation of Softmax layer. We perform experiments on two popular datasets that
are used for image caption generation: MS COCO and
3.6. Implementation Details Flickr30k. These two datasets contain 123,000 and 31,000
In the following experiments, we use the 16-layer VG- images respectively, and each image has five reference cap-
GNet [42] model to compute CNN features and map the tions. For MS COCO, we reserve 5,000 images for vali-
last fully-connected layer’s output features to an embedding dation and 5,000 images for testing. For Flickr30k, we use
space via a linear transformation. 29,000 images for training, 1,000 images for validation, and
As for preprocessing of captions, we transform all let- 1,000 images for testing.
ters in the captions to lowercase and remove all the non- We choose four metrics for evaluating the quality of the
alphabetic characters. Words occur less than five times are generated sentences: BLEU-n [38] is a precision-based
replaced with an unknown token <UNK>. We truncate all metric. It measures how many words are shared by the gen-
the captions longer than 16 tokens and set the maximum erated captions and ground truth captions. METEOR [5] is
number of input words for CNNL to be 16. based on the explicit word to word matches between gen-
erated captions and ground-truth captions. CIDEr [45] is a
metric developed specifically for evaluating image captions.
3.6.1 Training Details It measures consensus in image caption by performing a
Term Frequency-Inverse Document Frequency weighting
In the training process, each image I has five correspond- for each n-gram. SPICE [1] is a more recent metric which
ing annotations. We first extract the image features V has been shown to correlate better with the human judgment
with CNNI . The image features V are used in each time of semantic quality than previous metrics.
step. We map each word representation S[t] with: x[t] =
We S[t] , t ∈ [0, N − 1]. After that, our network is trained 4.2. Models
to predict the words after it has seen the image and pre-
To gain insight into the effectiveness of CNNL , we com-
ceding words. Please note that we denote by S[0] a special
pare CNNL -based models with methods using the recurrent
<START> token and by S[N −1] a special <END> token
network only. For a fair comparison, the output dimensions
which designate the start and end of the sentence.
of all gates are fixed to 512.
For Flickr30K [39] and MS COCO [29] we set the di- Recurrent Network-based Models. We implement Re-
mensionality of the image features and word embeddings current Network-based Models based on the framework
as 512. All the models are trained with Adam [20], which proposed by Vinyals et al. [46], it takes an image as in-
is a stochastic gradient descent method that computes adap- put and predicts words with one-layer Recurrent Network.
tive learning rate for each parameter. The learning rate is Here we use the publicly available implementation Neu-
initialized with 2e-4 for Flickr30K and 4e-4 for MS COCO, raltalk2 1 . We evaluate four baseline models: Simple RNN,
and the restart technique mentioned in [33] is adopted to im- RHN, LSTM, and GRU.
prove the convergence of training. Dropout and early stop-
CNNL -based Models. As can be seen in Figure 1. The
ping are used to avoid overfitting. All weights are randomly
CNNL -based models employ a CNNL to obtain the bottom-
initialized except for the CNN weights. More specifically,
up representation from the sequence of words and cooperate
we fine-tune the VGGNet when the validation loss stops de-
with the Recurrent Network to predict the next word. Image
creasing. The termination of training is determined by eval-
features and words representation learned from CNNI and
uating the CIDEr [45] score on the validation split after each
training epoch. 1 https://github.com/karpathy/neuraltalk2

1226
CNNL respectively are fused with the multimodal function. Approach B@4 C Approach B@4 C
We implement four CNNL -based models: CNNL +Simple Avghistory +RHN 30.1 95.8 CNNL2 words +RHN 29.2 93.8
CNNL∗16 words +RHN 28.9 91.9 CNNL4 words +RHN 29.5 95.8
RNN, CNNL +RHN, CNNL +LSTM, and CNNL +GRU.
CNNL +RHN 30.6 98.9 CNNL8 words +RHN 30.0 95.9
Table 2. Results of different history information encoding ap-
4.3. Quantitative Results
proaches on MS COCO. CNNLN words takes N previous words as
We first evaluate the importance of language CNN for inputs, where we set N to 2, 4, and 8. Avghistory computes an av-
image captioning, then evaluate the effects of CNNL on two erage over history word embeddings. CNNL∗16 words replaces the 2nd
datasets (Flickr30K and MS COCO), and also compare with and 4th convolutional layers in CNNL with the max-pooling layer.
the state-of-the-art methods. 4.3.2 Results Using CNNL on MS COCO
Table 3 shows the generation performance on MS COCO.
4.3.1 Analysis of CNNL
By combine CNNL , our methods clearly outperforms the
It is known that CNNL -based models have larger capac- recurrent network counterpart in all metrics.
ity than RNNs. To verify that the improved performance
Approach B@1 B@2 B@3 B@4 M C S
is from the developed CNNL rather than due to more lay-
Simple RNN 70.1 52.1 37.6 27.0 23.2 87.0 16.0
ers/parameters, we set the hidden and output sizes of RNNs CNNL +RNN 72.2 55.0 40.7 29.5 24.5 95.2 17.6
to 512 and 9568 (vocabulary size), and list the parameters RHN 70.5 52.7 37.8 27.0 24.0 90.6 17.2
of each model as well as their results in Table 1. CNNL +RHN 72.3 55.3 41.3 30.6 25.2 98.9 18.3
LSTM 70.8 53.6 39.5 29.2 24.5 92.6 17.1
Approach Params B@4 C Approach Params B@4 C CNNL +LSTM 72.1 54.6 40.9 30.4 25.1 99.1 18.0
Simple RNN 5.4M 27.0 87.0 LSTM 7.0M 29.2 92.6 GRU 71.6 54.1 39.7 28.9 24.3 93.3 17.2
CNNL 6.3M 18.4 56.8 LSTM2 9.1M 29.7 93.2 CNNL +GRU 72.6 55.4 41.1 30.3 24.6 96.1 17.6
CNNL +RNN 11.7M 29.5 95.2 LSTM3 11.2M 29.3 92.9 Table 3. Performance comparison on MS COCO, where M is short
Table 1. Results on MS COCO, where B@n is short for BLEU- for METEOR, and S is short for SPICE.
n, C is short for CIDEr. All values are reported as percentage
(Bold numbers are the best results). CNNL contains five temporal
Approach B@1 B@2 B@3 B@4 M C S
convolutional layers, the kernel size of the first two convolutional
Simple RNN 60.5 41.3 28.0 19.1 17.1 32.5 10.5
layers is 5, and the rest kernel size of convolutional layers is 3.
CNNL +RNN 71.3 53.8 39.6 28.7 22.6 65.4 15.6
As seen in Table 1, the parameter size of the 3-layer RHN 62.1 43.1 29.4 20.0 17.7 38.4 11.4
LSTM (LSTM3 ) is close to that of the CNNL +RNN. CNNL +RHN 73.8 56.3 41.9 30.7 21.6 61.8 15.0
Adding the 2nd LSTM layer (LSTM2 ) improves the per- LSTM 60.9 41.8 28.3 19.3 17.6 35.0 11.1
formance of LSTM, but it is still lower than CNNL +RNN. CNNL +LSTM 64.5 45.8 32.2 22.4 19.0 45.0 12.5
Meanwhile, LSTM3 does not show improvements as the GRU 61.4 42.5 29.1 20.0 18.1 39.5 11.4
model experiences overfitting. This issue is even worse CNNL +GRU 71.4 54.0 39.5 28.2 21.1 57.9 14.5
Table 4. Performance comparison on Flickr30k.
on Flickr30K which has relatively small number of training
data. Note that CNNL (without RNNs) achieves lower per- Among these models, CNNL +RHN achieves the best
formance than CNNL +RNN. We find that those predicted performances in terms of B@(3,4), METEOR, and SPICE
captions of CNNL (without RNNs) only are short, but con- metrics, CNNL +LSTM achieves the best performance in
tain primary attributes, e.g., CNNL model generates: “a CIDEr metric (99.1), and CNNL +GRU achieves the best
person on a wave”, while CNNL +RNN provides: “a young performance in B@(1,2) metrics. Although the absolute
man surfing a wave”. This finding shows that the temporal gains across different B@n metrics are similar, the percent-
recurrence of RNNs is still crucial for modeling the short- age of the relative performance improvement is increas-
term contextual information across words in the sentence. ing from B@1 to B@4. It does show the advantage of
We further compare language CNNs with different in- our method in terms of better capturing long-term depen-
put words and with max-pooling operations, where those dency. Note that the CNNL +RNN model achieves bet-
language CNNs are combined with RHN instead of RNN. ter performance than simple RNN model and outperforms
Table 2 shows that larger context windows achieve better LSTM model. As mentioned in Section 3.4, LSTM net-
performance. This is likely because CNNL with larger works model the word dependencies with multi-gates and
window size can better utilize contextual information and the internal memory cell. However, our CNNL +RNN with-
learn better word embedding representation. In addi- out memory cell works better than LSTM model. We think
tion, the performance of CNNL∗16 words +RHN is inferior to the reason is that our language CNN takes all history words
CNNL +RHN, which experimentally supports our opinion as input and explicitly model the long-term dependencies in
that max-pooling operations lose information about the lo- history words, this could be regarded as an external “mem-
cal order of words. ory cell”. Thus, the CNNL ’s ability to model long-term de-

1227
Flickr30k MS COCO
Approach BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr
BRNN [19] 57.3 36.9 24.0 15.7 — 62.5 45.0 32.1 23.0 19.5 66.0
Google NIC [46] — — — — — — — — 27.7 23.7 85.5
LRCN [6] 58.8 39.1 25.1 16.5 — 66.9 48.9 34.9 24.9 — —
MSR [7] — — — — — — — — 25.7 23.6 —
m-RNN [35] 60.0 41.0 28.0 19.0 — 67.0 49.0 35.0 25.0 — —
Hard-Attention [51] 66.9 43.9 29.6 19.9 18.5 70.7 49.2 34.4 24.3 23.9 —
Soft-Attention [51] 66.7 43.4 28.8 19.1 18.5 71.8 50.4 35.7 25.0 23.0 —
ATT-FCN [53] 64.7 46.0 32.4 23.0 18.9 70.9 53.7 40.2 30.4 24.3 —
ERD+GoogLeNet [52] — — — — — — — — 29.8 24.0 88.6
emb-gLSTM [15] 64.6 44.6 30.5 20.6 17.9 67.0 49.1 35.8 26.4 22.7 81.3
VAE [40] 72.0 53.0 38.0 25.0 — 72.0 52.0 37.0 28.0 24.0 90.0
State-of-the-art results using model assembling or extra information
Google NICv2 [47] — — — — — — — — 32.1 25.7 99.8
Attributes-CNN+RNN [50] 73.0 55.0 40.0 28.0 — 74.0 56.0 42.0 31.0 26.0 94.0
Our results
CNNL +RNN 71.3 53.8 39.6 28.7 22.6 72.2 55.0 40.7 29.5 24.5 95.2
CNNL +RHN 73.8 56.3 41.9 30.7 21.6 72.3 55.3 41.3 30.6 25.2 98.9
CNNL +LSTM 64.5 45.8 32.2 22.4 19.0 72.1 54.6 40.9 30.4 25.1 99.1
CNNL +GRU 71.4 54.0 39.5 28.2 21.1 72.6 55.4 41.1 30.3 24.6 96.1
Table 5. Performance in terms of BLEU-n, METEOR, and CIDEr compared with other state-of-the-art methods on the MS COCO and
Flickr30k datasets. For those competing methods, we extract their performance from their latest version of papers.
pendencies can be taken as enhancement of simple RNNs, performance (for some metrics) than ours are Attributes-
which can solve the difficulty of learning long-term depen- CNN+RNN [50] and Google NICv2 [47]. However, Wu et
dencies. al. [50] employ an attribute prediction layer, which requires
determining an extra attribute vocabulary. While we gener-
4.3.3 Results Using CNNL on Flickr30K ate the image descriptions only based on the image features.
Google NICv2 [47] is based on Google NIC [46], the re-
We also evaluate the effectiveness of language CNN on the sults of Google NICv2 are achieved by model ensembling.
smaller dataset Flickr30K. The results in Table 4 clearly All our models are based on VGG-16 for a fair compari-
indicate the advantage of exploiting the language CNN to son with [6, 7, 15, 35, 50, 51]. Indeed, better image CNN
model the long-term dependencies in words for image cap- (e.g. Resnet [11]) leads to higher performance2 . Despite all
tioning. Among all models, CNNL +RHN achieves the best this, the CIDEr score of our CNNL +LSTM model can still
performances in B@(1,2,3,4) metrics, and CNNL +RNN achieve 99.1, which is comparable to their best performance
achieves the best performances in METEOR, CIDEr, and even with a single VGG-16 model.
SPICE metrics. Performance on Flickr30K. The results on Flickr30K
As for the low results (without CNNL ) on Flickr30k, we are reported on the left-hand side of Table 5. In-
think that it is due to lack of enough training data to avoid terestingly, CNNL +RHN performs the best on this
overfitting. In contrast, our CNNL can help learn better smaller dataset and even outperforms the Attributes-
word embedding and better representation of history words CNN+RNN [50]. Obviously, there is a significant
for word prediction, and it is much easier to be trained com- performance gap between CNNL +RNN/RHN/GRU and
pared with LSTM due to its simplicity and efficiency. Note RNN/RHN/GRU/LSTM models. This demonstrates the ef-
that the performance of LSTM and CNNL +LSTM models fectiveness of our language CNN on the one hand, and also
are lower than RHN/GRU and CNNL +RHN/GRU. This il- shows that our CNNL +RNN/RHN/GRU models are more
lustrates that the LSTM networks are easily overfitting on robust and easier to train than LSTM networks when less
this smaller dataset. training data is available.
4.4. Qualitative Results
4.3.4 Comparison with State-of-the-art Methods
Figure 3 shows some examples generated by our mod-
To empirically verify the merit of our models, we compare els. It is easy to see that all of these caption generation
our methods with other state-of-the-art approaches. models can generate somewhat relevant sentences, while
Performance on MS COCO. The right-hand side of Ta- 2 We
uploaded the results based on Resnet-101+CNNL +LSTM (named
ble 5 shows the results of different models on MS COCO jxgu LCNN NTU) to the official MS COCO evaluation server (https:
dataset. CNNL -based models perform better than most im- //competitions.codalab.org/competitions/3221), and
age captioning models. The only two methods with better achieved competitive ranking across different metrics.

1228
CNNL+RHN : a black and white cat looking at itself in a mirror CNNL +RHN : a man standing next to a child on a snow covered slope CNNL +RHN : a man talking on a cell phone while walking down a street CNNL +RHN : a cat looking at a dog in a door
CNNL+RNN : a black and white cat sitting in front of a mirror CNNL +RNN : a man and a woman standing on a snow covered slope CNNL +RNN : a man is talking on a cell phone CNNL +RNN : a cat is looking at a dog in front of a window
GRU : a black and white cat standing next to a mirror GRU : a man and a child standing on a snow covered slope GRU : a man is talking on a cell phone in the street GRU : a cat standing next to a door looking out a window
LSTM : a black and white cat sitting in a bathroom sink LSTM : a man and a child are standing in the snow LSTM : a man is talking on his cell phone LSTM : a dog and a cat are standing in front of a window
RNN : a cat sitting on the floor in a bathroom RNN : a man and a woman are skiing on the snow RNN : a man standing next to a woman talking on a cell phone RNN : a cat sitting on the side of the road
- there is a black tuxedo cat looking in the mirror - a woman and child in ski gear next to a lodge - a man talking on the phone in front of a blue car - a dog looking at a cat through a glass window
- two cats sitting on top of a wooden floor - a man and a child are smiling while standing on skiis - a man on a telephone holds his hand up to his other ear as he walks - a cat is outside looking through in at a dog
- a cat looking at itself in the mirror next to a tripod - a young man poses with a little kid in the snow - a man standing next to a car with a cellphone - the dog wants to go outside with the cat
- a cat and a tripod sitting in front of a mirror - an adult and a small child dressed for skiing - a man is talking on a cell phone next to a city street - a cat sitting outside of a door next to a dog
- a close up of a cat in a mirror - a man and a little girl in skis stand in front of a mountain lodge - a man standing on the side of the street with a cell phone up to his - a cat sitting at a sliding glass door

Figure 3. Qualitative results for images on MS COCO. Ground-truth annotations (under each dashed line) and the generated descriptions
are shown for each image.

CNNL+RHN : a large bird perched on top of a tree CNNL+RNN : a black and white dog standing on a sidewalk CNNL+LSTM : a man and a woman holding a glass of wine CNNL+GRU : a polar bear in the water with a ball in its mouth

- a bear that is hanging in a tree - a tan dog standing on a sidewalk next to a UNK and grass - a couple that is eating some food together - a child is looking a white bear in a water aquarium
- a young bear holding onto a pine tree - the dog is standing outside all alone in the backyard - the groom is feeding the bride a slice of cake - child stands viewing a polar bear as it dives under water to
- a bear cub in the branches of a pine tree - a dog standing on a brick walk way - a man feeding a piece of cake to his bride retrieve a bone
- a black bear cub climbing a pine tree - a brown dog is standing on the side of a walk way - a husband feeds his wife a piece of cake - a boy reaching towards an aquarium in which a polar bear
- the bear cub UNK high up into the tree - a brown dog standing on a brick path - a groom feeding wedding cake to his bride chews on a bone
- a boy watches a polar bear chew on a bone
- a young boy touching the glass of a polar bear

Figure 4. Some failure descriptions for images on MS COCO. Ground-truth descriptions are under each dashed line.

the CNNL -based models can predict more high-level words 5. Conclusion
by jointly exploiting history words and image representa-
tions. Take the last image as an example, compared with In this work, we present an image captioning model with
the sentences generated by RNN/LSTM/GRU model, “a language CNN to explore both hierarchical and temporal in-
cat is looking at a dog in front of a window” generated by formation in sequence for image caption generation. Exper-
CNNL +RNN is more precise to describe their relationship iments conducted on MS COCO and Flickr30K image cap-
in the image. tioning datasets validate our proposal and analysis. Perfor-
mance improvements are clearly observed when compared
Besides, our CNNL -based models can generate more de- with other image captioning methods. Future research di-
scriptive sentences. For instance, with the detected object rections will go towards integrating extra attributes learning
“cat” in the first image, the generated sentence “a black and into image captioning, and how to apply a single language
white cat looking at itself in a mirror” by CNNL +RHN de- CNN for image caption generation is worth trying.
picts the image content more comprehensively. The results
demonstrate that our model with language CNN can gener- Acknowledgements
ate more humanlike sentences by modeling the hierarchical
structure and long-term information of words. This work is supported by the National Research Foun-
dation, Prime Ministers Office, Singapore, under its IDM
Figure 4 shows some failure samples of our CNNL - Futures Funding Initiative, and NTU CoE Grant. This re-
based models. Although most of the generated captions search was carried out at ROSE Lab at Nanyang Techno-
are complete sentences. However, the biggest problem is logical University, Singapore. ROSE Lab is supported by
that those predicted visual attributes are wrong. For ex- the National Research Foundation, Prime Ministers Office,
ample, “bear” in the first image is detected as “bird”, and Singapore, under its IDM Futures Funding Initiative and ad-
“brown” in the second image is detected as “black and ministered by the Interactive and Digital Media Programme
white”. This will decrease the precision-based evaluation Office. We gratefully acknowledge the support of NVAITC
score (e.g., B@n). We can improve our model by further (NVIDIA AI Tech Centre) for our research at NTU ROSE
taking high-level attributes into account. Lab, Singapore.

1229
References [21] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multimodal
neural language models. In ICML, 2014.
[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice:
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Semantic propositional image caption evaluation. In ECCV,
classification with deep convolutional neural networks. In
2016.
NIPS, 2012.
[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term
[23] J. Kuen, Z. Wang, and G. Wang. Recurrent attentional net-
dependencies with gradient descent is difficult. IEEE trans-
works for saliency detection. In CVPR, 2016.
actions on neural networks, 1994.
[24] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
[3] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent
and T. L. Berg. Baby talk: Understanding and generating
visual representation for image caption generation. In CVPR,
image descriptions. In CVPR, 2011.
2015.
[25] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initial-
[4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
ize recurrent networks of rectified linear units. arXiv preprint
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
arXiv:1504.00941, 2015.
representations using rnn encoder-decoder for statistical ma-
chine translation. EMNLP, 2014. [26] R. Lebret, P. O. Pinheiro, and R. Collobert. Phrase-based
image captioning. ICML, 2015.
[5] M. Denkowski and A. Lavie. Meteor universal: Language
specific translation evaluation for any target language. In [27] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Ef-
ACL, 2014. ficient backprop. In Neural networks: Tricks of the trade,
pages 9–48. Springer, 2012.
[6] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- [28] J. Li, M.-T. Luong, and D. Jurafsky. A hierarchical neural
rell. Long-term recurrent convolutional networks for visual autoencoder for paragraphs and documents. ACL, 2015.
recognition and description. In CVPR, 2015. [29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[7] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From mon objects in context. In ECCV, 2014.
captions to visual concepts and back. In CVPR, 2015. [30] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness
[8] C. Gan, T. Yang, and B. Gong. Learning attributes equals in neural image captioning. In AAAI, 2017.
multi-source domain generalization. In CVPR, 2016. [31] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal
[9] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and lstm with trust gates for 3d human action recognition. In
S. Lazebnik. Improving image-sentence embeddings using ECCV, 2016.
large weakly annotated photo collections. In ECCV, 2014. [32] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot. Global
[10] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, context-aware attention lstm networks for 3d action recogni-
T. Liu, X. Wang, and G. Wang. Recent advances in convo- tion. CVPR, 2017.
lutional neural networks. arXiv preprint arXiv:1512.07108, [33] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de-
2015. scent with restarts. ICLR, 2016.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [34] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when
deep residual networks. In ECCV, 2016. to look: Adaptive attention via a visual sentinel for image
[12] M. Hodosh, P. Young, and J. Hockenmaier. Framing image captioning. CVPR, 2017.
description as a ranking task: Data, models and evaluation [35] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille.
metrics. JAIR, 2013. Deep captioning with multimodal recurrent neural networks
[13] B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural net- (m-rnn). ICLR, 2014.
work architectures for matching natural language sentences. [36] J. B. Oliva, B. Poczos, and J. Schneider. The statistical re-
In NIPS, 2014. current unit. ICML, 2017.
[14] P. Hu, B. Shuai, J. Liu, and G. Wang. Deep level sets for [37] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
salient object detection. 2017. ing images using 1 million captioned photographs. In NIPS,
[15] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars. Guid- 2011.
ing long-short term memory for image caption generation. [38] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
ICCV, 2015. method for automatic evaluation of machine translation. In
[16] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully ACL, 2002.
convolutional localization networks for dense captioning. In [39] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo,
CVPR, 2016. J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Col-
[17] N. Kalchbrenner and P. Blunsom. Recurrent continuous lecting region-to-phrase correspondences for richer image-
translation models. In EMNLP, 2013. to-sentence models. In ICCV, 2015.
[18] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convo- [40] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and
lutional neural network for modelling sentences. ACL, 2014. L. Carin. Variational autoencoder for deep learning of im-
[19] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- ages, labels and captions. NIPS, 2016.
ments for generating image descriptions. In CVPR, 2015. [41] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel.
[20] D. Kingma and J. Ba. Adam: A method for stochastic opti- Self-critical sequence training for image captioning. CVPR,
mization. ICLR, 2015. 2017.

1230
[42] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. ICLR, 2014.
[43] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training
very deep networks. In NIPS, 2015.
[44] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence
learning with neural networks. In NIPS, 2014.
[45] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. In CVPR,
2015.
[46] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. In CVPR, 2015.
[47] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: Lessons learned from the 2015 mscoco image caption-
ing challenge. PAMI, 2016.
[48] M. Wang, Z. Lu, H. Li, W. Jiang, and Q. Liu. gen cnn:
A convolutional architecture for word sequence prediction.
ACL, 2015.
[49] J. Weston, S. Chopra, and A. Bordes. Memory networks.
arXiv preprint arXiv:1410.3916, 2014.
[50] Q. Wu, C. Shen, L. Liu, A. Dick, and A. v. d. Hengel. What
value do explicit high level concepts have in vision to lan-
guage problems? CVPR, 2016.
[51] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdi-
nov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neu-
ral image caption generation with visual attention. ICML,
2015.
[52] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Co-
hen. Review networks for caption generation. NIPS, 2016.
[53] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image cap-
tioning with semantic attention. CVPR, 2016.
[54] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, and J. Schmid-
huber. Recurrent highway networks. arXiv preprint
arXiv:1607.03474, 2016.

1231

You might also like