0% found this document useful (0 votes)

30 views

Image Caption Generator Using CNN and LSTM

In this have a look at, we discover the integration of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks for the motive of image caption generation, a mission that involves a fusion of herbal language processing and computer imaginative and prescient techniques to describe images in English. Delving into the realm of photograph captioning, we meticulously investigate several fundamental concepts and methodologies associated with this area.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Image Caption Generator Using CNN and LSTM

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

Image Caption Generator Using

CNN and LSTM
Monali Kapuriya1; Zemi Lakkad2; Satwi Shah3
Nirma University

Abstract:- In this have a look at, we discover the identified after which the gaps in the templates are stuffed.
integration of Convolutional Neural Networks (CNNs) For instance, Farhadi et al. [1] use 3 specific factors of a scene
and Long Short-Term Memory (LSTM) networks for to fill the template slots for producing photo captions. A
the motive of image caption generation, a mission that Conditional Random Field (CRF) is leveraged by Kulkarni et
involves a fusion of herbal language processing and al. [2] to come across the objects, attributes, and prepositions
computer imaginative and prescient techniques to before filling in the blanks. Template-based totally
describe images in English. Delving into the realm of approaches are capable of generate grammatically accurate
photograph captioning, we meticulously investigate captions, however for the reason that templates are
several fundamental concepts and methodologies predefined, it can't generate variable-period captions. In this
associated with this area. Our technique includes phase, we discuss the 3 most important classes of existing
leveraging prominent equipment inclusive of the Keras photo captioning methods: template-based totally photograph
library, numpy, and Jupyter notebooks to facilitate the captioning, retrieval-primarily based photo captioning, and
development of our studies. Furthermore, we delve novel caption era. Template-primarily based techniques have
into the utilization of the flickr_dataset and CNNs for fixed templates with clean slots to generate captions. In these
image category, elucidating their significance in our systems, the one-of-a-kind gadgets, moves and attributes are
examination. Through this research endeavor, we aim first diagnosed and then the gaps within the templates are
to make a contribution to the development of image stuffed. For example, Farhadi et al. [1] use three unique
captioning structures with the aid of combining factors of a scene to fill the template slots for producing
modern-day strategies from both laptop imaginative picture captions. A Conditional Random Field (CRF) is
and prescient and herbal language processing domain leveraged by means of Kulkarni et al. [2] to stumble on the
names. items, attributes, and prepositions earlier than filling inside
the blanks. Template-primarily based techniques are capable
Keywords:- CNN, LSTM, Image Captioning, Deep Learning. of generating grammatically correct captions, however for the
reason that templates are predefined, it cannot generate
I. INTRODUCTION variable-period captions.

Every day we see numerous photos in the surroundings,  Proposed Work:

on social media and in the newspapers. Humans are capable CNN- A Convolutional Neural Network is a specially
of apprehending pictures themselves most effectively. We structured neural system intended to specialize in processing
people can select out the pictures without their designated structured information such as 2D grids, making them perfect for the
captions however on the other hand machines want photos to analysis of images. The systems scan images in an ordered way,
get skilled first then it'd generate the photograph caption extract meaningful features and combine them to characterize the
routinely. Image captioning may additionally advantage for content they perceive. While analyzing, CNNs can recognize
hundreds of purposes, as an instance assisting the visionless variations of diverse transformations, including translations,
character using text-to-speech via actual time remarks rotations, scaling, and distortions, with minimal preprocessing
approximately encompassing the situation over a digicam compared to traditional approaches, which use hand-crafted filters.
feed, enhancing social clinical leisure with the useful resource The architecture is based on the human visual system’s tradition with
of reorganizing the captions for photographs in social feed a highly-organized visual cortex that organizes almost all living
along with messages to speech. Facilitating kids in spotting neurons into columnar patterns permitting individual neurons to
materials similarly to getting to know the language. Captions react rapidly to stimuli in particular receptive fields, guaranteeing
for each picture on the arena huge net can produce faster. broad coverage of the visual scenes. To summarize, in our work on
computer vision in image processing, CNNs act as mechanical tools
II. LITERATURE REVIEW with convolutional layers to identify the edges, textures, and other
visual parts, with complete pooling to obtain spatial information.
In this section, we talk about the three principal classes This architectural model allows it to learn more coordinated
of existing image captioning techniques: template-based manifests as extra information flows through the network. At the end
image captioning, retrieval-based photograph captioning, and of the network, connected layers combine the data to classify it as
novel caption generation. Template-based strategies have high quality or not. This network is predominantly trained using
fixed templates with blank slots to generate captions. In those supervised learning and is then sent to transfer learning Using CNNs
systems, the distinct objects, actions and attributes are first for image classification has much useful stuff; for some related

IJISRT24AUG851 www.ijisrt.com 1375

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

activities, CNNs are heavily used for object identification and and numerous such pictures, leads to overfitting since the
segmentation and even for some extraterrestrial events such as number of parameters becomes excessive even further. A
natural language processing and speech synthesis. Convolutional Neural Network for this purpose would
involve a 3D arrangement, in groups of neurons that evaluate
 CNN Architecture : smaller sections or “features” of the image. For each neuron
For examining large images and videos, the traditional to pass its output to the next layer, each neuron cluster
neural network layout, in which every neuron in one layer specializes in recognizing particular parts of the image, such
connects to every neuron in the next, is inefficient. The usage as the nose, ear, mouth, and leg. The ultimate output is a map
of standard-sized images, which are high-resolution and that shows the relevance of each individual feature to the
contain a greyscale, RGB colors, grayscale which is large, whole classification.

Fig 1 CNN Architecture

 How does CNN Work ? comparison falls apart the moment someone wants to
It has already been mentioned, a fully-connected neural compare one image with another. CNN, however, performs
network, where all inputs in one layer are connected to all photo contrast piece through piece. The primary advantage of
inputs in the following layers, is relevant for some functions. the use of the CNN set of rules lies in its potential to take
In terms of CNN principles, neurons within a layer can pictures as enter and generate a feature map based on
connect to some neighbors instead of binding to all the cells similarities and variations between input snap shots. CNN
in the uniformed way . As a result, the network becomes less effectively classifies pixels, generating a matrix called a
complex and less computational . In the context of image characteristic map, where similar pixels are grouped together.
processing, two images are compared by checking each point These feature maps are instrumental in extracting vital
in terms of pixels. This algorithm works perfectly well when statistics from input images.
one wants to compare identical images. However, the

Fig 2 How CNN Works

IJISRT24AUG851 www.ijisrt.com 1376

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

To develop a CNN, three types of layers must be to understand it. This creates a denser version of the map
designed: Convolutional, Pooling and Fully Connected. In holding important details about the picture. For an optimal
the first Convolutional layer, the image input is processed to density on each image, we need to repeat convolutional and
generate a feature map which acts as an input to subsequent pooling layers many times. Sorting pixels according to their
layers such as the Pooling layer. The features in this map are similarities or differences is what this final stage does in order
simpler segments of the image that will make it easier for us to facilitate classification through them all.

Fig 3 Methematical Process in CNN

Considerable effort has gone into this classification  Origin of LSTM :

process which aims at extracting essential details from a LSTM, quick for Long Short-Term Memory, changed
picture leading towards identification of objects, people and into first of all proposed via German researchers, Sepp
other factors present in most pictures. These layers enable Hochreiter and Jurgen Schmid Huber, in 1997. Within the
CNNs to locate and extract features from images making area of recurrent neural networks in Deep Learning, LSTM
flexible-length inputs become fixed-size outputs. Widespread performs a pivotal function. What distinguishes LSTM is its
application of CNN techniques serve as pointers towards their potential, no longer handy to store entered statistics but also
usefulness and relevance in different area. to generate predictions for subsequent statistics points
autonomously. This unique characteristic permits LSTM
networks to retain information for a designated duration and
utilize it to forecast or infer destiny values. Consequently,
LSTM is favored over traditional RNNs for tasks requiring
memory and prediction talents.

Fig 4 RNNs for Sequential Data

 The Problem with RNNs (Recurrent Neural Networks): However, their usage in solving real-international problems
RNNs, necessary to deep learning methodologies, excel is constrained due to the Vanishing Gradient problem.
in managing complex computational duties including object
class and speech reputation. They are in particular adept at  Vanishing Gradient Problem –
addressing sequential activities, wherein each step's The vanishing gradient problem poses a huge danger to
information is based on information from previous steps. the effectiveness of RNNs. Typically, RNNs are designed to
Ideally, we opt for RNNs with full-size datasets and stronger hold records for short periods and shop maximum efficiently
skills. These RNNs find sensible programs in responsibilities using a constrained array of facts. They battle to not forget all
like stock forecasting and advanced speech reputation. facts and values over prolonged durations. Therefore, the

IJISRT24AUG851 www.ijisrt.com 1377

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

reminiscence functionality of RNNs is more proper for from mistakes, allowing them to keep and method facts
shorter statistics arrays and quick timeframes. This problem throughout multiple time steps. This iterative mastering
turns into in particular referred to in evaluation to process allows less complicated backpropagation through
conventional RNNs at the same time as fixing responsibilities time and layers.
regarding time steps. As the form of time steps will grow,
RNNs encounter problems in preserving and processing LSTMs rent a couple of gates to govern records,
statistics via backpropagation. The want to keep facts values processing it before passing it to the final gate for output. This
from each time step effects an exponential increase in contrasts with RNNs, which without delay transmit facts to
reminiscence requirements, rendering it impractical for the final gate without intermediate processing. The gates
RNNs. This ends inside the emergence of the vanishing inner LSTM networks permit versatile facts manipulation,
gradient hassle, impeding the community's functionality to which consist of facts storage and retrieval, with every gate
correctly learn and generalize from records. independently able to make judgments based on the entered
statistics. Additionally, those gates personal the capability to
 What can be done so as to solve this Vanishing Gradient autonomously alter their openness or closure, contributing to
problem with RNNs – the network's adaptability and effectiveness in getting to
To cope with the vanishing gradient problem, Long know and keeping facts.
Short-Term Memory (LSTM), a subtype of RNNs, is applied.
LSTMs are particularly designed to conquer this undertaking  Architecture of LSTM:
by means of retaining facts values for prolonged periods, The structure of a Long Short-Term Memory (LSTM)
successfully mitigating the vanishing gradient hassle. Unlike network includes several key components:
conventional RNNs, LSTMs are based to continuously study

Fig 5 LSTM Architecture

 Forget Gate: This gate comes to a decision which facts  Cell State Update: The cell state 𝐶𝑡 is up to date by means
from the previous state should be discarded or forgotten. of first forgetting irrelevant records (using the forget
It takes as enter the previous cell state 𝐶t−1 and the about gate) after which including new facts (using the
contemporary enter 𝑥𝑡, and produces a overlook vector 𝑓𝑡 input gate).

 Input Gate: The input gate determines which new  Output Gate: The output gate controls what records from
information has to be saved within the cell state. It the mobile state need to be exposed to the output. It makes
incorporates elements: a sigmoid layer that comes to a a decision the next hidden state ℎ𝑡 primarily based at the
decision which values could be updated, and a tanh layer modern enter 𝑥𝑡 and the preceding hidden state ℎ𝑡−1 , as
that creates a vector of latest candidate values 𝐶~𝑡 that well as the up to date cell state 𝐶𝑡.
would be added to the state.

IJISRT24AUG851 www.ijisrt.com 1378

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

Fig 6 Gates in LSTM Architecture

LSTMs, a subset of RNNs, have a more capacity to cellular applications, exemplify the practical utility of
retain statistics compared to conventional RNNs and are LSTM in this area.
extensively hired throughout various fields nowadays. The
simple structure of an LSTM consists of three primary gates:  Stock Market Prediction: LSTMs are also hired in
the Forget gate, Input gate, and Output gate. These gates are forecasting inventory market tendencies by studying
chargeable for storing data and producing the favored output. historical market statistics. Predicting market fluctuations
Whenever LSTM networks are mentioned, these three gates is inherently challenging because of the complex and
are continually noted. unpredictable nature of financial markets. However,
LSTM models can leverage stored facts on past market
 Use of LSTM Network: behavior to expect future versions and trends. Achieving
LSTMs are notably applied in a big selection of deep correct predictions on this area requires large education of
gaining knowledge of obligations, primarily centered on the LSTM model, the usage of massive datasets spanning
forecasting future records based totally on beyond facts. Two extended durations, now and again even days.
prominent examples encompass textual content prediction
and stock marketplace prediction.  Image Caption Generation Model:
We combine- CNN and LSTM architectures into a
 Text Prediction: LSTMs are notably effective in unified CNN-LSTM mode-l to create an image caption
predicting text sequences. Their long-time period ge-nerator. First, a pre-trained Xce-ption model CNN extracts
reminiscence capability permits them to count on the vital feature-s from the input image - visual characteristics
subsequent phrases in a sentence. This is accomplished and information ke-y to understanding the image's conte-nt.
via the LSTM community's ability to internally save Next, the LSTM processe-s those extracted fe-atures to
statistics approximately word meanings, patterns, and generate- coherent, descriptive- captions. By leveraging CNN
contextual usage, permitting it to generate accurate strengths for visual data and LSTM for te-xt generation, the
predictions. Text prediction programs, which include mode-l effectively translate-s visual content into accurate,
chatbots usually employed in eCommerce web sites and meaningful textual descriptions.

Fig 7 Image Captioning Model

IJISRT24AUG851 www.ijisrt.com 1379

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

 Implementation: and dense layers.

First, the Flickr8k dataset—a renowned reference for
image captioning tasks—is selected for the purpose of the III. RESULT
study. Splitting the dataset into training, validation, and test
sets is a component of data pre-processing, along with caption The distribution of occurrences of words in the
normalization for consistency. After images are converted generated headlines is shown in this visualization. This report
into numerical arrays, a pre-trained Inception v3 model is provides insight into the diversity and frequency of words
used to extract high-level characteristics from the images. used, indicating the richness of vocabulary that this model
The model architecture is based on an encoder-decoder uses. A balanced distribution is a representation of linguistic
design, where the decoder is made up of an LSTM layer and diversity, while the skewed distribution may call into
an embedding layer, and the encoder contains normalization question biases or constraints in model interpretation.

Fig 8 Dountchart for Word Occurrence

The histogram provides insight into the descriptive depth of the generated captions, showing the distribution of caption lengths
across the data set. The model's adaptability to various image complexity is demonstrated by a wide range of length. The clusters of
lengths can indicate the tendency to be verbosity or concise, which may lead to adjustments for optimal text length.

Fig 9 Histogram of Length of Captions Generated my Model

IJISRT24AUG851 www.ijisrt.com 1380

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

The qualitative assessment of the model's performance content is reflected in each of the captions, demonstrating its
in image understanding and caption generation is provided by ability to identify the features and contextualize them into a
these headings. The model's interpretation of the visual coherent narrative.

Fig 10 The Qualitative Assessment of the Model's Performance in Image

Consistency between visual content and textual REFERENCES

descriptions indicates proficiency in image semantics
comprehension, guiding enhancements for improved [1]. Abhaya Agarwal and Alon Lavie. 2008. Meteor, m-
captioning accuracy. bleu and m-ter: Evaluation metrics for high-
correlation with human rankings of machine
IV. CONCLUSION translation output. In Proceedings of the
ThirdWorkshop on Statistical Machine Translation.
In conclusion, this paper has explored numerous deep Association for Computational Linguistics, 115–118.
gaining knowledge of-primarily based techniques to [2]. Ahmet Aker and Robert Gaizauskas. 2010. Generating
photograph captioning, categorizing them, offering a typical image descriptions using dependency relational
block diagram in their fundamental groupings, and patterns. In Proceedings of the 48th annual meeting of
comparing their advantages and downsides. We've the association for computational linguistics.
additionally examined the metrics and datasets used, along Association for Computational Linguistics, 1250–
with a short summary of experimental findings. While good 1258.
sized progress has been made in deep getting to know- [3]. Peter Anderson, Basura Fernando, Mark Johnson, and
primarily based photo labeling structures, achieving sturdy Stephen Gould. 2016. Spice: Semantic propositional
labeling techniques capable of generating notable labels for image caption evaluation. In European Conference on
every image remains a task. With the persistent introduction Computer Vision. Springer, 382–398.
of recent deep studying community designs, automated [4]. Peter Anderson, Xiaodong He, Chris Buehler, Damien
captioning will remain an outstanding vicinity of research for Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
the foreseeable future. As the number of social media users 2017. Bottom-up and top-down attention for image
maintains upward thrust, with many sharing pics, the demand captioning and vqa. arXiv preprint arXiv:1707.07998
for captions is anticipated to grow. Therefore, initiatives in (2017).
this area keep good sized capacity for reaping rewards, a [5]. Jyoti Aneja, Aditya Deshpande, and Alexander G
growing target market of social media users. Schwing. 2018. Convolutional image captioning.In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 5561–5570.

IJISRT24AUG851 www.ijisrt.com 1381

Volume 9, Issue 8, August – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24AUG851

[6]. Lisa Anne Hendricks, Subhashini Venugopalan,

Marcus Rohrbach, Raymond Mooney, Kate Saenko,
Trevor Darrell, Junhua Mao, Jonathan Huang,
Alexander Toshev, Oana Camburu, et al. 2016. Deep
compositional captioning: Describing novel object
categories without paired training data. In Proceedings
of the IEEE Conference on Computer Vision and
Pattern Recognition.
[7]. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by jointly
learning to align and translate. In International
Conference on Learning Representations (ICLR).
[8]. Shuang Bai and Shan An. 2018. A Survey on
Automatic Image Caption Generation.
Neurocomputing. ACM Computing Surveys, Vol. 0,
No. 0, Article 0. Acceptance Date: October 2018. 0:30
Hossain et al.
[9]. Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for MT evaluation with improved
correlation with human judgments. In Proceedings of
the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or
summarization, Vol. 29. 65–72.