Paper 91-Comparative Evaluation of CNN Architectures

(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
Comparative Evaluation of CNN Architectures for

Image Caption Generation
Sulabh Katiyar1 , Samir Kumar Borgohain2

Department of Computer Science and Engineering
National Institute of Technology, Silchar
Assam, India 788010
Abstract—Aided by recent advances in Deep Learning, Image medicine, etc. as well as applications in some specific problems
Caption Generation has seen tremendous progress over the such as helping visually impaired people in navigation or
last few years. Most methods use transfer learning to extract generating news information from images.
visual information, in the form of image features, with the help
of pre-trained Convolutional Neural Network models followed During the last few years there has been tremendous
by transformation of the visual information using a Caption progress in Image caption generation due to advances in Com-
Generator module to generate the output sentences. Different
puter Vision and Natural Language Processing domains. The
methods have used different Convolutional Neural Network
Architectures and, to the best of our knowledge, there is no progress made in Object Recognition task due to availability
systematic study which compares the relative efficacy of different of large annotated datasets such as ImageNet [1] has led
Convolutional Neural Network architectures for extracting the to availability of pre-trained Convolutional Neural Network
visual information. In this work, we have evaluated 17 different (CNN) models which can extract useful information from the
Convolutional Neural Networks on two popular Image Caption image in vectorized form which can then be used by caption
Generation frameworks: the first based on Neural Image Caption generation module (called the decoder) to generate caption
(NIC) generation model and the second based on Soft-Attention sentences. Similarly, progress in solving machine translation
framework. We observe that model complexity of Convolutional with methods such as encoder-decoder framework proposed in
Neural Network, as measured by number of parameters, and [2], [3] has led to adoption of similar format for Image Caption
the accuracy of the model on Object Recognition task does
Generation where the source sentence in machine translation
not necessarily co-relate with its efficacy on feature extraction
for Image Caption Generation task. We release the code at task is replaced by the image in caption generation task and
https://github.com/iamsulabh/cnn variants. then the process is approached as ‘translation’ of image to
sentence, as has been done in works such as [4], [5], [6]. The
Keywords—Convolutional Neural Network (CNN); image cap- attention based framework proposed by [7] where the decoder
tion generation; feature extraction; comparison of different CNNs learns to focus on certain parts of the source sentence at certain
time-steps has been adapted in caption generation in such as
way that the decoder focuses on portions of image at certain
I. I NTRODUCTION time-steps [8]. A detailed survey of Image Caption Generation
has been provided in [9] and [10].
Image Caption Generation involves training a Machine
Learning model to learn to automatically produce a single Although there has been a lot of focus on the decoder which
sentence description for an image. For human beings it is a ’interprets’ the image features and ’translates’ them into a
trivial task. However for a Machine Learning method to be able caption, there has not been enough focus on the encoder which
to perform this task, it has to learn to extract all the relevant ’encodes’ the source image into a suitable visual representation
information contained in the image and then to convert this (called image features). This is mainly because most methods
visual information into a suitable representation of the image use transfer learning to extract image features from pre-trained
which can be used to generate a natural language sentence Convolutional Neural Networks (CNN) [11] which are trained
description of the image. The visual features extracted from the on the Object Detection task of the ImageNet Large Scale
image should contain information about all the relevant objects Visual Recognition Challenge [12] where the goal is to predict
present in the image, the relationships among the objects and the object category out of 1000 categories annotated in the
the activity settings of the scene. Then the information needs dataset. Since the last layer of the CNN produces a 1000
to be suitably encoded, generally in a vectorized form, so that length vector containing relative probabilities of all object
the sentence generator module can convert this into a human categories, the last layer is dropped and the output(s) of
readable sentence. Furthermore, some information may be intermediate layer(s) is(are) used as image features. Numerous
implicit in the scene such as a scene where a group of football CNN architectures have been proposed with varying com-
players are running in a football field but the football is not plexity and efficacy and many have been utilized for Image
present in the scene frame. Thus the model may need to learn Caption Generation as well. However, this makes it difficult
some level of knowledge about the world as well. However, the to undertake a fair comparison of Image Caption Generation
ability to automate the caption generation process has many methods since the difference in performance could be either
benefits for the society as it can either replace or complement due to difference in effectiveness of decoders in sentence
any method that seeks to extract some information from the generation or due to difference in effectiveness of encoders
images and has applications in the fields of education, military, in feature extraction.
www.ijacsa.thesai.org 793 | P a g e
Vol. 11, No. 12, 2020
Hence, in this work we evaluate Image Caption Generation which are similar to the target image using a visual similarity
using popular CNN architectures which have been used for measure. Then a word probability density conditioned on the
Object Recognition task and analyse the co-relation between target image is calculated using the captions of the images
model complexity, as measured by the total number of pa- that were retrieved in the last step. Then the captions in the
rameters, and the effectiveness of different CNN architectures dataset are scored using this word probability density and
on feature extraction for Image Caption Generation. We use the sentence which has the highest score is selected as the
two popular Image Caption Generation frameworks: (a) Neural caption for the target image. The retrieval based methods
Image Caption (NIC) Generator proposed in [6] and (b) Soft generally produce grammatically correct and fluent captions
Attention based Image Caption Generation proposed in [8]. because they select human generated sentence for a target
We observe that the performance of Image Caption Generation image. However, this approach is not scalable because a large
varies with the use of different CNN architectures and is number of sentences need to be included in the pool for each
not directly correlated with either the model complexity or kind of environment. Also the selected sentence may not even
performance of CNN on object recognition task. To further be relevant because the same kind of objects may have different
validate our findings, we evaluate multiple versions of ResNet kind of relationships among them which cannot be described
CNN [13] with different depths (number of layers in the CNN) by a fixed set of sentences.
and complexity: ResNet18, ResNet34, ResNet50, ResNet101,
Another class of approaches are the Template based
ResNet152 where the numerical part in the name stands for the
methods which construct a set of hand-coded sentence
number of layers in the CNN (such as 18 layers in ResNet18
templates according to the rules of grammar and semantics and
and so on). We evaluate multiple versions of VGG CNN
optimization algorithms. Then the methods plug in different
[14] architecture: VGG-11, VGG-13, VGG-16 and VGG-19
object components and their relationships into the templates
and multiple versions of DenseNet CNN [15] architecture:
to generate sentences for the target image. For example, in
Densenet121, Densenet169, DenseNet201 and Densenet161,
[19], Conditional Random Fields are used to recognize image
each of which has different number of parameters. We observe
contents. A graph is constructed with the image objects,
that performance does not improve with the increase in the
their relationships and attributes as nodes of the graph. The
number of layers, and consequently, increase in model com-
reference captions available with the training images are used
plexity. This further validates our observation that effectiveness
to calculate pairwise relationship functions using statistical
of CNN architectures for Image Caption Generation depends
inference and the visual concepts are used to determine the
on the model design and that the model complexity or the
unary operators on the nodes. In [20], visual models are used
performance on Object Detection task are not good indicators
to extract information about objects, attributes and spatial
of effectiveness of CNN for Image Caption Generation. To
relationships. The visual information is encoded in the form
the best of our knowledge, this is the first such detailed
of [<adjective1,object1>,preposition,<adjective2,object2>]
analysis of the role of CNN architectures as image feature
triplets. Then n-gram frequency counts are extracted from
extractors for Image Caption Generation task. In addition, to
web-scale training dataset using statistical inference. Dynamic
further the future research work in this area, we also make the
programming is used to determine optimal combination of
implementation code1 available for reference.
phrases to perform phrase fusion to construct the sentences.
This paper is divided into following sections: In Section Although the Template based approaches are able to generate
II,we discuss the relevant methods proposed in the literature, more varied captions, they are still handicapped by the
in Section III, we discuss the methodology of our work, in problems of scalability because a large number of sentence
Section IV we present and discuss the experimental results templates are to be hand-coded and even then a lot of phrase
and in Section V we discuss the implications of our work and combinations may be left out.
possible future studies.
In recent years, most of the works proposed in the literature
have employed Deep Learning to generate captions. Most
II. R ELATED W ORK works use CNNs, which are pre-trained on the ImageNet
Some of the earliest works attempted to solve the problem Object Recognition dataset [1], to extract vectorized repre-
of caption generation in constrained environments such as sentation of the image. Words of a sentence are represented
the work proposed in [16] where the authors try to generate as Word Embedding vectors extracted from a look-up table.
captions for objects present in an office setting. Such methods The look up table is learned during training as the set of
had limited scalability and applications. Some works tried weights of the Embedding Layer. The image and word in-
to address the task as a Retrieval problem where a pool of formation is combined in different ways. Most methods use
sentences was constructed which could describe all (or most) different variants of Recurrent Neural Network [21] (RNN)
images in a particular setting. Then for a target image, a to model the temporal relationships between words in the
sentence which was deemed appropriate by the algorithm was sentence. In [5], the image features extracted from CNN and
selected as the caption. For example, in [17], the authors the word embeddings are mapped to the same vector space and
construct a ’meaning space’ which consists of triplets of merged using element-wise addition at each time-step. Then
<objects, actions, scene>. This is used as a common mapping the merged image features and word embeddings are used as
space for images and sentences. A similarity measure is used to input to a MultiModal Reccurent Neural Network (m-RNN)
find sentences with the highest similarity to the target image which generates the output. The authors use AlexNet[22] and
and the most similar sentence is selected as the caption. In VGG-16 [14] as CNNs to extract image features. In [4] a
[18], a set of images are retrieved from the training data Bidirectional Recurrent Neural Network is used as decoder
because it can map the word relationships with both the words
1 https://github.com/iamsulabh/cnn variants that precede and the words that succeed a particular word in
Vol. 11, No. 12, 2020
the sentence. The word embeddings and image features are image. We can use chain rule because generation of words of
merged before being fed into the decoder. The authors use a sentence depends on previously generated words, and hence
AlexNet [22] CNN to extract image features. In [6], a Long Equation 1 can be extended to the constituent words of the
Short Term Memory Network [23] is used as decoder. The sentence as,
image features are mapped to the vector space spanned by L
hidden state representations of the LSTM and are used as initial X
logp (S|I, θ) = logp (wt |I, θ, w1 , w2 , ..., wt−1 ) (2)
hidden state of the LSTM. Thus the image information is fed
t=0
to LSTM at initial state only. The LSTM takes in previously
generated words as input (with a special ’start token’ as where w1 , w2 , ..., wL are the words in the sentence ‘S’ of
the first input) and generates the next word sequentially. length L. This equation can be modelled using a Recurrent
The authors use [24] as CNN for extracting image features. Neural Network which generates the next output conditioned
Using the Attention approach, in [8] the authors train the on the previous words of the sentence. We have used LSTM
model to focus on certain parts of the image at certain time- as the RNN variant for our experiments.
steps. This attention mechanism takes as input, the image
In this work, we evaluate caption generation performance
features and output until the last time-step and generates
on two popular encoder-decoder frameworks with certain mod-
an image representation conditioned on text input. This is
ifications. For both the methods, we experiment with different
merged with the word embeddings at the current time-step
CNN architectures for image feature extraction and analyse the
by using vector concatenation operation and used as input to
effects on performance.
the LSTM generator. The authors used VGGNet [14] CNN as
The first method is based on Neural Image Caption Generation
image feature extractor. Recently, methods using Convolutional
method proposed in [6]. However, unlike the method proposed
Neural Networks as sequence generators have been proposed
in [6], we have not used model ensembles to improve perfor-
such as in [25] for text generation. Based on this approach, [26]
mance. In addition, we have extracted image features from a
propose a method which uses a CNN for encoding the image
lower layer of the CNN which generates a set of vectors each
and another CNN for decoding the image. The CNN decoder
of which contain information about a region of the image.
is similar to the one used in [25] and uses a hierarchy of layers
We have observed that this leads to better performance as the
to model word relationships of increasing complexity. The
decoder is able to use region specific information to generate
authors use ResNet152[13] CNN to encode the image features.
captions. Throughout this paper, this will be referred to as
More recently, Transformer Network has been used which uses
‘CNN+LSTM’ approach with the word ‘CNN’ replaced by
self-attention to model word relationships instead of Recurrent
the name of CNN architecture used in the experiment. For
or Convolutional operations [27]. Based on this approach a
example, ‘ResNet18+LSTM’ refers to caption generation with
Transformer based caption generation is proposed in [28].
ResNet18 as the CNN.
Since most of the methods use different CNN architectures
The second method is similar to the Soft Attention method
to extract image features, there is a need for a comparative
proposed in [8]. We use an attention mechanism which learns
analysis of their effectiveness in image feature extraction using
to focus on certain portions of image for at certain time-
the same overall format for caption generation.
steps for generating the captions. Similar to the CNN+LSTM
approach, this Soft Attention approach will be referred as
III. P ROPOSED M ETHOD ‘CNN+LSTM+Attention’ approach with the word ‘CNN’ re-
placed by the name of CNN architecture used. Figure 1
In image caption generation, given an image the task is
explains both the methods.
to generate a set of words S = {w1 , w2 , w3 , ..., wL } where
wi ∈ V where L is the length of the sentence and V represents
the vocabulary of the dataset. The words w1 and wL are A. Image Feature Extraction
usually the special tokens for start and end of the sentence. For extracting image features, we use CNNs which were
Two more special tokens for ‘unknown’ and ‘padding’ are also pre-trained on ImageNet datset [1] for the Imagenet Large
used for representing unknown words (which may be the stop Scale Visual Recognition Challenge [12]. The models generate
words and rare words that have been removed from dataset a single output vector containing the relative probabilities of
to speed up training) and padding the end of the sentence different object categories (with 1000 categories in total). We
(to make all sentences of equal length because RNNs do remove this last layer from the CNN since we need more
not handle sentence of different lengths in the same batch), fine-grained information. Also, we remove all the layers at
respectively. Given pairs of image and sentence, (IN , Si ) for the top (with the input layer being called the bottom layer)
i ∈ (1, 2, 3, ..., j), during training we maximize the probability which produce a single vector as output because we need
P (Si |IN , θ) where j is the number of captions for an image in a set of vectors as output which contain information about
training set and θ represents the set of parameters in the model. different regions of the image. Hence, the image features are
Hence, as mentioned in [6], during training the model learns a set of vectors denoted as, a = {a1 , a2 , a3 , ...a|a| }, ai ∈ RD
to update the set of parameters θ such that the probability of where |a| is the number of feature vectors contained in a, R
generation of correct captions is maximized according to the represents real numbers and D is dimension of each vector.
equation, X For example, ResNet152 CNN [13] generates a set of 8, 2048
θ? = argmax logp (S|I, θ) (1) dimensional vectors.
(I,S)
The set of image feature vectors thus generated are used
where θ is the set of all parameters of the model, I is the in two ways in the methods used in this work. In the
image and S is one of the reference captions provided with the ’CNN+LSTM’ method, the image features are mapped to
Vol. 11, No. 12, 2020
w0 , w1 , ..., wn w0 , w1 , ..., wn
SOFTMAX SOFTMAX
AL
LSTM LSTM
CNN Embedding Embedding

CNN
image text input

image text input
(a) (b)
Fig. 1. An Overview of the Two Approaches Proposed in this Work: (a) Encoder-Decoder based Approach. (b) Attention based Approach with an Attention
Mechanism to Focus on Salient Portions of the Image. (AL stands for Attention Layer)
the vector space of hidden state of the LSTM and used to state also depends on the previous hidden states, it can be
initialize the hidden and cell state of the LSTM decoder. For modelled as a function of previous hidden state and inputs as,
the ‘CNN+LSTM+Attention’ method, in addition to hidden
and cell state initialization, the set of image feature vectors hi = fθ (wi−1 , hi−1 , I) (7)
is also used at each time-step to calculate attention weighted where fθ is the same differentiable function as in Equation 6
image features which contain information from those regions since the model is trained end-to-end with the same parame-
in the image which are important at the current time-step. We ters. And words are represented as word embeddings which is
explain this in detail in Sections III-B and III-C. a function that maps one-hot word vectors to the embedding
dimensions and is also learned with the rest of the model, as
B. CNN + LSTM Method
wie = fθ (wi ) (8)
In this method, we use a CNN encoder to extract image
information and use that information as the initial hidden state where fθ is the same differentiable function in Equation 6 and
of the LSTM decoder. Using the set of image feature vectors wie is the word embedding vector for word wi .
thus obtained as described in Section III-A, we obtain a single We use LSTM as described in [23]. The LSTM has three
vector by averaging the values of all vectors in the set as, control gates: input, forget and update gates. The equations for
|a| updating the different gates are as follows:
X
aave = ai , i ∈ (1, 2, ..., |a|) (3) it = σ(Wi xt + Ri ht−1 + bi ) (9)
i
ft = σ(Wf xt + Rf ht−1 + bf ) (10)
where |a| is the length of set of image feature vectors extracted
from the CNN. This is used to generate the initial hidden ot = σ(Wo xt + Ro ht−1 + bo ) (11)
and cell states of the LSTM by using an affine transformation
followed by a non-linearity (T anh function) as, ct = ft ct−1 + it tanh(Wz xt + Rz ht−1 + bz ) (12)
h0 = T anh(aave ? W h + bh ) (4) ht = ot tanh(ct ) (13)
c0 = T anh(aave ? W c + bc ) (5) where Wi and Ri , Wf and Rf , Wo and Ro and Wz and

Rz are weight matrices (input and recurrent weight matrices)
where W h , W c and bh , bc are weights and biases of the pairs for the input, forget, output and the input modulator(tanh)
MultiLayer Perceptron (MLP) which is used to model the gates, respectively. b is the bias vector and σ is the sigmoid
transformations. function. It is expressed as σ(x) = 1/1+exp(x) and condenses
The successive hidden and cell states are generated during the input to the range of (0,1). tanh is the is hyperbolic tangent
training. Since the generation of words is dependent on the function which condenses the input in the range (-1,1). it , ot
previous words in the sentence as depicted in Equation 2, and ft are input, output and forget gates respectively. The input
this dependence can be modelled using the hidden state of gate processes the input information. The output gate generates
the LSTM (which is also modulated by the cell state). Hence, output based on the input and some of this information has to
be dropped which is decided by the cell state. The cell state
Pθ (wi |I, w1 , w2 , ..., wi−1 ) = Pθ (wi |I, hi ) = fθ (wi , I, hi ) stores information about the context. The forget gate decides
(6) what contextual information has to be dropped from the cell
where fθ is any differentiable function and since it is recursive state. The internal structure of the LSTM has been depicted in
in nature it can be modelled using an RNN. Since the hidden Fig. 2.
Vol. 11, No. 12, 2020
IV. E XPERIMENTS AND R ESULTS

Block output ht
In this section we describe the experimental details
Previous cell state Current cell state
and the results. We have evaluated Squeezenet [31], Shuf-
ct−1 × + ct flenet [32], Mobilenet [33], MnasNet [34], ResNet [13],
GoogLeNet [29], DenseNet [15], Inceptionv4 [24], AlexNet
Tanh [22], DPN (Dual Path Network) [36], ResNext [37], SeNet
[39], PolyNet [40], WideResNet [38], VGG [14], NASNet-
× ×
Large [35] and InceptionResNetv2 [41] CNN models. Out of
σ σ σ
these we have evaluated five versions of ResNet, viz. Resnet18,
Tanh
ResNet34, ResNet50, Resnet101, Resnet152, four versions of
ht−1 ht DenseNet, viz. Densenet121, Densenet169, DenseNet201 and
Densenet161 and four versions of VGG, viz. VGG-11, VGG-
13, VGG-16 and VGG-19 which are similar in architecture
but differ widely in terms of number of parameters and also in
Input xt terms of accuracy and error rates on Object Recognition task
with ImageNet dataset.
Fig. 2. Illustration of a basic LSTM Cell.
We have evaluated the performance using BLEU, ME-
TEOR, CIDER, ROUGE-L and SPICE metrics that were
recommended in MSCOCO Image caption Evaluation task
C. CNN + LSTM + Attention Method [42]. The evaluation results are provided in Tables I and
II for ‘CNN+LSTM’ and ‘CNN+LSTM+Attention’ methods,
In this method, in addition to the the initial time-step, the
respectively. In addition we have provided some examples of
image information is fed into the LSTM at each time-step.
generated captions in Tables III and IV for both the methods.
However a separate attention mechanism generates information
which is extracted from only certain regions of image which We have used Flickr8k [30] dataset which contains around
are relevant at the current time-step. 8000 images with five reference captions each. Out of the 8000
images, around 1000 are earmarked for validation set, around
The attention mechanism produces a context vector which
1000 are meant for test set and the remaining are for training
represents the relevant portion of the image at each time-step.
set.
First a set of weights are calculated for each image feature
vector ai ∈ a, i ∈ (1, 2, 3, ..., |a|) as described in Section III-A. We can make following observations from the results:
P = {pti }, pti = fatt (ai , ht−1 ) (14) • For example, there is a variation of around 4 to 5
points in the evaluation metrics between the best and
where i ∈ (1, 2, 3, ..., |a|). Then the attention weights are worst performing models in both Tables I and II.
calculated as,
exp(pti ) • In addition, the performance of a decoder framework
α = {αti }, αti = Pn (15) which employs additional methods of guidance (such
k=1 exp(ptk ) as attention) but uses a lower performing encoder
where α is the set of weights, one for each image feature can be worse than simpler methods which use bet-
P|a| ter performing CNN encoder. For example, the best
vector ai in a such that k=1 αi = 1.
performing model using CNN+LSTM method (Table
Then the context vector is calculated by another function, I) have better performance than lower performing
zi = Φ({ai }, {αi }) (16) models using CNN+LSTM+Attention method (Table
II).
We have used the function fatt and Φ as desrcibed in [8].
• Although different variants of the same model (such
With the context vector thus obtained, the equations for the as ResNet, Densenet and VGG) differ greatly with
gates of the LSTM decoder would be, respect to the number of parameters, they generate
image captioning performances which differ only by
around 1 point on most evaluation metrics. ResNet18,
it = σ(Wi xt + Ri ht−1 + Zi zt + bi ) (17)
being the smallest model in terms of number of
ft = σ(Wf xt + Rf ht−1 + Zf zt + bf ) (18) parameters (among ResNet based CNNs) performs
competitively as compared to the larger ResNet vari-
ot = σ(Wo xt + Ro ht−1 + Zo zt + bo ) (19) ants which have many times more parameters. We
ct = ft ct−1 + it tanh(Wc xt + Rc ht−1 + Zc zt + bc ) (20) also observe that DenseNet121 and VGG-11 being the
smallest models among DenseNet and VGG models,
ht = ot tanh(ct ) (21) respectively, outperform other DenseNet and VGG
based CNNs in evaluation scores along certain met-
where Wi and Ri , Wf and Rf , Wo and Ro and Wc and Rc are rics.
weight matrices (input and recurrent weight matrices) pairs for
the input, forget, output and the input modulator(tanh) gates, • Also the different variants of ResNet [13], VGG [14]
respectively. b is the bias vector and σ is the sigmoid function. and DenseNet [15] architectures differ greatly in terms
Vol. 11, No. 12, 2020
TABLE I. P ERFORMANCE OF CNN+LSTM METHOD USING DIFFERENT CNN ARCHITECTURES .

CNN name Parameters (in Top-5 O.D. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDER ROUGE-L SPICE
thousands) error
Squeezenet [31] 1,248 19.58 60.04 40.65 26.95 17.61 18.12 42.87 44.05 12.44
Shufflenet[32] 2,279 11.68 59.70 41.18 27.84 18.67 18.24 44.36 43.66 12.61
Mobilenet[33] 3,505 9.71 60.60 41.72 28.44 18.87 18.83 47.97 44.28 13.50
MnasNet[34] 4,383 8.456 61.19 43.02 29.43 20.10 18.94 48.19 44.88 13.46
Densenet121 [15] 7,979 7.83 61.62 43.36 29.47 19.88 19.39 48.99 45.32 13.64
ResNet18 [13] 11,689 10.92 62.21 43.45 29.84 20.30 18.91 48.31 45.33 13.49
GoogLeNet [29] 13,005 10.47 60.69 41.57 28.20 18.91 18.66 46.42 44.38 13.01
Densenet169 [15] 14,150 7.00 63.73 45.00 30.87 21.13 19.95 52.88 46.41 14.32
DenseNet201 [15] 19,447 6.43 63.29 45.11 31.36 21.63 19.80 52.21 46.40 14.16
Resnet34 [13] 21,798 8.58 61.08 42.69 29.32 19.98 18.98 49.78 45.01 13.32
Resnet50 [13] 25,557 7.13 61.86 43.79 30.10 20.27 19.11 50.86 45.76 13.89
Densenet161 [15] 28,681 6.20 63.12 44.68 30.76 20.79 20.00 54.24 46.19 14.26
Inceptionv4 [24] 42,680 4.80 59.49 40.47 27.00 18.03 18.22 43.17 43.61 12.23
Resnet101 [13] 44,549 6.44 62.77 44.11 30.62 21.10 19.65 53.00 45.91 14.04
InceptionResNetv2 [41] 54,340 4.9 59.50 40.55 27.36 18.21 18.79 46.35 43.54 12.90
ResNet152 [13] 60,193 5.94 62.30 44.24 30.84 21.21 19.50 55.10 46.14 14.20
AlexNet [22] 61,101 20.91 59.24 40.17 26.82 17.87 17.51 41.09 42.79 11.78
DPN131 [36] 75,360 5.29 59.60 40.69 27.58 18.86 18.00 42.36 43.15 12.67
ResNext101 [37] 88,791 5.47 62.38 43.79 29.85 20.20 19.54 51.37 45.54 14.05
NASNetLarge [35] 88,950 3.8 56.08 36.76 23.54 15.46 16.76 34.74 40.50 11.56
SeNet154 [39] 115,089 4.47 61.67 43.18 29.72 20.19 19.48 49.89 45.24 13.95
PolyNet [40] 118,733 4.25 60.26 41.26 27.68 18.68 18.02 44.23 43.61 12.37
WideResNet101 [38] 126,886 5.72 61.42 42.48 28.71 19.16 18.64 46.24 44.41 13.23
VGG-11(bn) [14] 132,869 11.37 61.70 43.37 30.08 20.86 19.38 48.98 45.80 13.62
VGG-13(bn) [14] 133,054 10.75 60.79 42.42 28.91 19.70 19.06 46.57 44.84 13.39
VGG-16(bn) [14] 138,366 8.50 60.56 41.98 28.66 19.51 19.04 48.41 44.82 13.71
VGG-19(bn) [14] 143,678 9.12 61.40 43.09 29.49 20.02 19.15 49.42 45.43 13.61
TABLE II. P ERFORMANCE OF CNN+LSTM+ATTENTION METHOD USING DIFFERENT CNN ARCHITECTURES .

CNN name Parameters (in Top-5 O.D. BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDER ROUGE-L SPICE
thousands) error
Squeezenet [31] 1,248 19.58 60.79 42.29 28.78 19.41 18.80 46.54 44.48 12.85
Shufflenet[32] 2,279 11.68 62.36 43.87 30.42 21.00 19.18 49.01 45.00 13.50
Mobilenet[33] 3,505 9.71 63.69 45.33 31.72 21.89 19.63 55.36 46.28 14.25
MnasNet[34] 4,383 8.456 63.99 45.75 32.11 22.36 19.78 54.84 46.17 14.02
Densenet121 [15] 7,979 7.83 64.11 45.67 31.76 22.07 20.43 55.85 46.74 14.91
ResNet18 [13] 11,689 10.92 63.26 44.87 31.07 21.24 20.08 52.44 45.84 13.75
GoogLeNet [29] 13,005 10.47 62.91 44.27 30.27 20.50 19.51 50.72 46.02 13.80
Densenet169 [15] 14,150 7.00 64.48 46.17 32.28 22.30 20.81 56.25 46.82 14.93
DenseNet201 [15] 19,447 6.43 64.38 46.26 32.41 22.49 20.73 59.71 47.19 15.13
Resnet34 [13] 21,798 8.58 63.36 45.28 31.88 22.23 19.88 55.35 46.17 14.40
Resnet50 [13] 25,557 7.13 65.32 46.92 32.81 22.58 20.87 57.12 46.95 14.90
Densenet161 [15] 28,681 6.20 65.00 46.99 32.83 22.56 20.44 56.74 47.57 14.93
Inceptionv4 [24] 42,680 4.80 60.17 42.24 28.71 19.35 18.76 48.00 44.33 13.26
Resnet101 [13] 44,549 6.44 64.33 45.99 32.13 22.02 20.29 56.09 46.58 14.80
InceptionResNetv2 [41] 54,340 4.9 61.46 42.98 29.20 19.84 19.20 49.83 44.44 13.81
ResNet152 [13] 60,193 5.94 65.26 47.55 33.72 23.67 20.94 58.33 47.54 15.18
AlexNet [22] 61,101 20.91 59.93 40.97 27.80 19.06 18.67 46.11 44.09 12.57
DPN131 [36] 75,360 5.29 62.68 44.17 30.47 20.53 19.41 49.98 45.51 13.95
ResNext101 [37] 88,791 5.47 64.78 46.07 32.36 24.45 20.93 57.67 40.04 15.28
NASNetLarge [35] 88,950 3.8 63.60 44.66 30.16 19.93 19.73 51.34 45.49 14.00
SeNet154 [39] 115,089 4.47 64.23 45.94 32.54 22.62 20.81 58.45 46.83 15.05
PolyNet [40] 118,733 4.25 62.56 44.78 31.16 21.48 19.75 53.38 45.96 13.81
WideResNet101 [38] 126,886 5.72 63.47 45.37 31.71 21.73 19.84 54.27 46.23 14.51
VGG-11(bn) [14] 132,869 11.37 63.00 44.66 31.18 21.68 19.79 52.24 46.42 14.08
VGG-13(bn) [14] 133,054 10.75 63.64 45.09 31.26 21.41 20.25 55.17 46.35 14.64
VGG-16(bn) [14] 138,366 8.50 63.81 45.77 32.35 22.55 20.19 55.13 46.72 14.49
VGG-19(bn) [14] 143,678 9.12 62.57 44.63 30.97 21.44 19.76 54.10 46.23 14.44
of Top-5 error on Object Detection task when evalu- instead of providing a general overview of the scene.
ated with Imagenet dataset. However, that difference
does not translate to similar difference in performance • In some cases, models do not recognize certain objects
in Image captioning task. in the image. In particular, we have observed many
cases of incorrect gender identification which points
• For each image, most models generate reasonable out to possible statistical bias in the dataset towards a
captions but there is a great variation in the caption particular gender in a certain context.
sentences generated with different models. In some
cases, captions generated with different models de- Thus we can conclude that choice of CNN for the encoder
scribe different portions of the image and sometimes significantly influences the performance of the model. In
some models focus on a certain object in the image addition to the general observations, we are able to deduce
Vol. 11, No. 12, 2020
TABLE III. E XAMPLES OF GENERATED CAPTIONS BY CNN+LSTM METHOD USING DIFFERENT CNN ARCHITECTURES .
Choice
of
CNN
ResNet- a white crane flies over the a man riding a motorcycle two young boys playing soc- two children playing in a pool a person riding a bike in the
152 water cer woods
Inception- a white crane flies over the a man riding a motorcycle two young boys playing soc- two children are playing in a a man in a blue shirt is riding
ResNet water cer pool a bike through a wooded area
NASNET a white crane flies over the a man is riding a red motor- a boy in a red uniform kicks a child plays in a pool a man on a bike in a forest
Large water cycle a soccer ball
VGG- a white bird flies through the a man riding a yellow motor- a boy in a soccer uniform a boy in a blue shirt plays with a dirt bike rider is airborne in
16 water cycle kicking a soccer ball a plastic toy the woods
Alexnet a white bird flies over the wa- a man in yellow and yellow a boy in a red uniform runs a young girl in a bathing suit a man is riding a bike on a dirt
ter motorcycle with a soccer ball is jumping into a pool path
Squeezenet a white bird in the water a man in a yellow helmet is a boy in a red and white uni- a little girl in a pink dress is a person riding a bike through
riding a bike form is playing soccer playing in a pool the woods
Densenet- a white bird flies over the wa- two bikers racing on the road two children playing soccer a young boy in a pool a man on a bike is riding a
201 ter bike through the woods
GoogLeNet a white bird flies through the a man on a motorcycle is rid- a young boy wearing a red a little boy is being splashed a man is riding a bike through
water ing on a street shirt and a blue soccer ball in a pool the woods
Shufflenet a white bird flies through the a man in a yellow helmet rid- a little boy in a red shirt is two young children playing in a man in a blue helmet rides
water ing a yellow bike playing with a soccer ball a fountain a bike through the woods
Mobilenet a white bird is flying over a person riding a bike in a race a boy in a red and white uni- a young boy in a swimming a person riding a dirt bike in
water form is playing soccer pool the woods
Resnext- a white bird flies over the wa- a man on a motorcycle is rid- a soccer player in a red uni- a little girl is playing in a pool a dirt bike rider in the woods
101 ter ing a motorcycle form kicks a soccer ball
Wide a white bird flies over the wa- a man riding a motorcycle two boys playing soccer on a a boy is splashing in a pool a person riding a dirt bike
ResNet- ter field through the woods
101
Mnasnet a white bird in the water a man in a yellow jacket rides a boy in a blue uniform is a little boy is playing in a pool a man on a bike in the woods
a motorcycle playing soccer
Inception a white bird flying over water a man is riding a bike on a two boys playing soccer two children play in a pool a person in a blue shirt and
track blue jeans is sitting on a tree
DPN- a white crane landing in the a person on a motorcycle a young boy in a soccer uni- a little boy in a swimming a person is riding a bike in the
131 water form kicking a soccer ball pool woods
Senet- a white crane flying over wa- a man is riding a yellow mo- a man in a red uniform kick- a little boy in a swimming a person rides a bike through
154 ter torcycle ing a soccer ball pool the woods
Polynet a white bird flies over the wa- a man rides a motorcycle a boy in a blue uniform is a girl in a pink shirt is playing a person rides a bike through
ter chasing a soccer ball in a kiddie pool the woods
the following specific observations about the choice of CNN: Also, since there is a great variation in the generated
captions for each image, it may be possible to use ensemble of
• ResNet [13] and DenseNet [15] CNN architectures are models, each of which utilize a different CNN as encoder, to
well suited to Image caption generation and generate increase diversity of generated captions. Also, model ensem-
better results while having a lower model complexity bling would lead to better performance. In the works proposed
than other architectures. in the literature, model ensembling has been used such as in
[6] but such model ensembles utilize similar models trained
V. C ONCLUSION with different hyperparameters. Using ensembles of models,
In this work, we have evaluated encoder-decoder and which use different CNN encoders is an area which could be
attention based caption generation frameworks with differ- explored in future works.
ent choices of CNN encoders and observed that there is Furthermore, we hope that this analysis of the effect of
a wide variation in terms of both the scores, as evaluated choice of different CNNs for image captioning will aid the
with commonly used metrics (BLEU, METEOR, CIDER, researchers in better selection of CNN architectures to be used
SPICE, ROUGE-L), and also the generated captions while as encoders in image feature extraction for Image Caption
using different CNN encoders. In terms of most metrics, there Generation.
is a difference in performance of around 4-5 points between
the worst and best performing models. Hence, the choice of
ACKNOWLEDGMENT
particular CNN architecture plays a big role in the image
caption generation process. In particular, ResNet and DenseNet We are greatly indebted to the MultiMedia Processing
based CNN architectures lead to better overall performance and Language Processing Laboratories at the Department
while at the same using lesser parameters than other models. of Computer Science and Engineering, National Institute of
Vol. 11, No. 12, 2020
TABLE IV. E XAMPLES OF GENERATED CAPTIONS BY CNN+LSTM+ATTENTION METHOD USING DIFFERENT CNN ARCHITECTURES .
Choice
of
CNN
ResNet- a man is standing in front of a dog runs through the snow a dog jumps over a hurdle a man and a woman are sitting a young girl in a pink bathing
152 a mountain on a fountain suit is playing in the water
Inception- a man with a backpack stands a man in a red jacket is skiing a brown dog jumps over a two children playing in a a little girl plays in the water
ResNet on a mountaintop down a snowy hill hurdle fountain
NASNET a man sits on top of a moun- a dog is running through the a brown dog is jumping over a group of people are playing a girl in a pink swimsuit is
Large tain snow a hurdle in a fountain jumping into the water
VGG- a man is standing on top of a a brown dog is standing in the a dog is jumping over a hurdle a group of people are sitting a woman in a swimsuit is
16 mountaintop snow on a ledge overlooking a city standing in the water
Alexnet a man is standing on top of a a brown dog is running a dog jumps over a hurdle a man in a black jacket is a boy in a pool
mountain through snow standing next to a building
Squeezenet a group of people sit on a a man in a red jacket is stand- a brown and white dog with a a group of people stand in a woman in a white shirt is
snowy mountain ing on a snowy hill red and white dog front of a building walking through the water
Densenet- a man in a blue shirt is stand- a brown dog is jumping in the a dog jumps over a hurdle a man and a woman are stand- a young girl jumping into the
201 ing in the mountains snow ing in front of a fountain water
GoogLeNet a man is standing on a moun- a black and white dog is run- a dog jumps over a hurdle a group of people stand in a a young boy plays in the water
taintop ning through the snow fountain
Shufflenet a man and a woman are sit- a man in a red jacket is stand- a woman and a dog are play- a man and a woman are walk- a man is standing on the shore
ting on a rock overlooking the ing on a snowy hill ing in a yard ing down a city street of a body of water
mountains
Mobilenet a man stands on a mountain a man is skiing down a snowy a woman and a woman sitting two men are standing next to a girl in the water
hill on a bench a fountain
Resnext- a man with a backpack stands a person is skiing down a a dog jumping over a hurdle a man and a woman are stand- a woman in a bikini is playing
101 on a mountaintop snowy hill ing in a fountain in the water
Wide a man is standing on top of a a dog is running through the a man and a dog on a leash a group of people are standing a woman in a bathing suit
ResNet- mountain snow in a fountain walks along the water
101
Mnasnet a man and a woman are stand- a brown dog is running a dog jumps over a hurdle a group of people are standing a boy is splashing in the water
ing in the mountains through the snow in front of a fountain
Inception a man stands on a rock over- a black and white dog in the a brown and white dog is a group of people are playing a dog walks through the water
looking the mountains snow jumping over a hurdle in a fountain
DPN- a man is standing on top of a a man and a dog play in the a dog jumps over a hurdle a group of people stand in a a girl in a swimsuit is jumping
131 mountain snow fountain into the water
Senet- a man is standing in front of a dog is running through the a dog jumps over a hurdle a man is standing in front of a girl in a red bathing suit
154 a mountain snow a fountain splashes in the water
Polynet a man stands on a mountain- a dog is jumping over a snowy a dog is jumping over a hurdle a group of people are standing a woman in a bathing suit is
top hill in front of a fountain fountain standing in front of a waterfall
Technology, Silchar, India for providing us the GPU-equipped [4] Karpathy, Andrej, and Li Fei-Fei. ”Deep visual-semantic alignments for
workstations which were indispensable for this work. Also, generating image descriptions.” In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 3128-3137. 2015.
the Office of Head of Department, Department of Computer
Science and Engineering at National Institute of Technology, [5] Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan
Yuille. ”Deep captioning with multimodal recurrent neural networks (m-
Silchar also provided one GPU equipped workstation for this rnn).” arXiv preprint arXiv:1412.6632 (2014).
work for which we are greatly obliged. [6] Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan.
This work was not supported by any financial grant and ”Show and tell: Lessons learned from the 2015 mscoco image captioning
challenge.” IEEE transactions on pattern analysis and machine intelli-
there do not exist any conflicts of interest. gence 39, no. 4 (2016): 652-663.
[7] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural
R EFERENCES machine translation by jointly learning to align and translate.” arXiv
[1] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. preprint arXiv:1409.0473 (2014).
”Imagenet: A large-scale hierarchical image database.” In 2009 IEEE [8] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,
conference on computer vision and pattern recognition, pp. 248-255. Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. ”Show, attend
Ieee, 2009. and tell: Neural image caption generation with visual attention.” In
[2] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. ”Sequence to sequence International conference on machine learning, pp. 2048-2057. 2015.
learning with neural networks.” In Advances in neural information [9] Bernardi, Raffaella, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut
processing systems, pp. 3104-3112. 2014. Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara
[3] Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bah- Plank. ”Automatic description generation from images: A survey of mod-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. ”Learning els, datasets, and evaluation measures.” Journal of Artificial Intelligence
phrase representations using RNN encoder-decoder for statistical ma- Research 55 (2016): 409-442.
chine translation.” arXiv preprint arXiv:1406.1078 (2014).
Vol. 11, No. 12, 2020
[10] Hossain, MD Zakir, Ferdous Sohel, Mohd Fairuz Shiratuddin, and [27] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Hamid Laga. ”A comprehensive survey of deep learning for image Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. ”Attention
captioning.” ACM Computing Surveys (CSUR) 51, no. 6 (2019): 1-36. is all you need.” In Advances in neural information processing systems,
[11] LeCun, Yann, Bernhard Boser, John Denker, Donnie Henderson, R. pp. 5998-6008. 2017.
Howard, Wayne Hubbard, and Lawrence Jackel. ”Handwritten digit [28] Yu, Jun, Jing Li, Zhou Yu, and Qingming Huang. ”Multimodal trans-
recognition with a back-propagation network.” Advances in neural in- former with multi-view visual representation for image captioning.” IEEE
formation processing systems 2 (1989): 396-404. Transactions on Circuits and Systems for Video Technology (2019).
[12] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
Satheesh, Sean Ma, Zhiheng Huang et al. ”Imagenet large scale visual V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv
recognition challenge.” International journal of computer vision 115, no. preprint arXiv:1409.4842, 2014.
3 (2015): 211-252. [30] Young, Peter, Alice Lai, Micah Hodosh, and Julia Hockenmaier. ”From
[13] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep image descriptions to visual denotations: New similarity metrics for se-
residual learning for image recognition.” In Proceedings of the IEEE mantic inference over event descriptions.” Transactions of the Association
conference on computer vision and pattern recognition, pp. 770-778. for Computational Linguistics 2 (2014): 67-78.
2016. [31] Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf,
[14] Simonyan, Karen and Zisserman, Andrew. Very deep convolutional net- William J. Dally, and Kurt Keutzer. ”SqueezeNet: AlexNet-level accuracy
works for large-scale image recognition. arXiv preprint arXiv:1409.1556, with 50x fewer parameters and¡ 0.5 MB model size.” arXiv preprint
2014. arXiv:1602.07360 (2016).
[15] Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. [32] Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shuf-
Weinberger. ”Densely connected convolutional networks.” In Proceedings flenet v2: Practical guidelines for efficient cnn architecture design.” In
of the IEEE conference on computer vision and pattern recognition, pp. Proceedings of the European conference on computer vision (ECCV),
4700-4708. 2017. pp. 116-131. 2018.
[16] Kojima, Atsuhiro, Takeshi Tamura, and Kunio Fukunaga. ”Natural [33] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
language description of human activities from video images based on and Liang-Chieh Chen. ”Mobilenetv2: Inverted residuals and linear
concept hierarchy of actions.” International Journal of Computer Vision bottlenecks.” In Proceedings of the IEEE conference on computer vision
50, no. 2 (2002): 171-184. and pattern recognition, pp. 4510-4520. 2018.
[17] Farhadi, Ali, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, [34] Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark
Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. ”Every picture Sandler, Andrew Howard, and Quoc V. Le. ”Mnasnet: Platform-aware
tells a story: Generating sentences from images.” In European conference neural architecture search for mobile.” In Proceedings of the IEEE
on computer vision, pp. 15-29. Springer, Berlin, Heidelberg, 2010. Conference on Computer Vision and Pattern Recognition, pp. 2820-2828.
2019.
[18] Mason, Rebecca, and Eugene Charniak. ”Nonparametric method for
data-driven image captioning.” In Proceedings of the 52nd Annual [35] Zoph, Barret, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le.
Meeting of the Association for Computational Linguistics (Volume 2: ”Learning transferable architectures for scalable image recognition.” In
Short Papers), pp. 592-598. 2014. Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 8697-8710. 2018.
[19] Kulkarni, Girish, Visruth Premraj, Vicente Ordonez, Sagnik Dhar,
Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. [36] Chen, Yunpeng, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan,
”Babytalk: Understanding and generating simple image descriptions.” and Jiashi Feng. ”Dual path networks.” In Advances in neural information
IEEE Transactions on Pattern Analysis and Machine Intelligence 35, no. processing systems, pp. 4467-4475. 2017.
12 (2013): 2891-2903. [37] Xie, Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming
[20] Li, Siming, Girish Kulkarni, Tamara Berg, Alexander Berg, and Yejin He. ”Aggregated residual transformations for deep neural networks.” In
Choi. ”Composing simple image descriptions using web-scale n-grams.” Proceedings of the IEEE conference on computer vision and pattern
In Proceedings of the Fifteenth Conference on Computational Natural recognition, pp. 1492-1500. 2017.
Language Learning, pp. 220-228. 2011. [38] Zagoruyko, Sergey, and Nikos Komodakis. ”Wide residual networks.”
[21] Elman, Jeffrey L. ”Finding structure in time.” Cognitive science 14, no. arXiv preprint arXiv:1605.07146 (2016).
2 (1990): 179-211. [39] Hu, Jie, Li Shen, and Gang Sun. ”Squeeze-and-excitation networks.”
[22] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet In Proceedings of the IEEE conference on computer vision and pattern
classification with deep convolutional neural networks. In NIPS, pp. recognition, pp. 7132-7141. 2018.
1097–1105, 2012. [40] Zhang, Xingcheng, Zhizhong Li, Chen Change Loy, and Dahua Lin.
[23] Hochreiter, Sepp, and Jürgen Schmidhuber. ”Long short-term memory.” ”Polynet: A pursuit of structural diversity in very deep networks.” In
Neural computation 9, no. 8 (1997): 1735-1780. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 718-726. 2017.
[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. In arXiv:1502.03167, [41] Szegedy, Christian, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi.
2015 ”Inception-v4, inception-resnet and the impact of residual connections
on learning.” arXiv preprint arXiv:1602.07261 (2016).
[25] Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and
Yann N. Dauphin. ”Convolutional sequence to sequence learning.” arXiv [42] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Per-
preprint arXiv:1705.03122 (2017). ona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. ”Microsoft
coco: Common objects in context.” In European conference on computer
[26] Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. ”Convo-
vision, pp. 740-755. Springer, Cham, 2014.
lutional image captioning.” In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 5561-5570. 2018.

Paper 91-Comparative Evaluation of CNN Architectures

Uploaded by

Copyright:

Available Formats

Paper 91-Comparative Evaluation of CNN Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 91-Comparative Evaluation of CNN Architectures

Uploaded by

Copyright:

Available Formats

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 12, 2020

Comparative Evaluation of CNN Architectures for

Sulabh Katiyar1 , Samir Kumar Borgohain2

CNN Embedding Embedding

image text input

h0 = T anh(aave ? W h + bh ) (4) ht = ot tanh(ct ) (13)

c0 = T anh(aave ? W c + bc ) (5) where Wi and Ri , Wf and Rf , Wo and Ro and Wz and

IV. E XPERIMENTS AND R ESULTS

TABLE I. P ERFORMANCE OF CNN+LSTM METHOD USING DIFFERENT CNN ARCHITECTURES .

TABLE II. P ERFORMANCE OF CNN+LSTM+ATTENTION METHOD USING DIFFERENT CNN ARCHITECTURES .

You might also like